Keywords

1 Introduction

Contemporary computer and communication infrastructure are exceedingly vulnerable to various types of attacks. The sheer amount and range of known and unknown malware is part of the reason why detecting malware is a complicated problem. A widespread way of inducing these attacks is using malwares such as worms, viruses, trojans, or spywares. Christodorescu and Jha [1] describe a malware instance as a program whose objective is malevolent. McGraw and Morrisett [2] describe malicious code as “any code added, changed, or removed from a software system in order to deliberately cause harm or subvert the intended function of the system.” Vasudevan and Yerraballi [3] describe malware as “a generic term that encompasses viruses, trojans, spywares and other intrusive code.”

In this work, we have considered Windows-based malware for analysis. The popularity of Windows based OS among average computer users is a huge reason why Windows has the most viruses of any operating system. It is also true that Microsoft’s apparent lack of concern for security in the early days made the problem much worse than it had to be. 2012 saw numerous attacks that were devised for Windows vulnerabilities with Windows 7’s malware infection rate as much as 182 %.

According to Sophos security report, many of the malwares found on Macs are also Windows based malware. (http://www.sophos.com/en-us/medialibrary/PDFs/other/sophossecuritythreatreport2013.pdf). Mac users who need occasional access to a Windows program sometimes decide to download it from third parties and may illegally create a license key using a downloadable generator. By doing so, they often encounter malware such as Mal/KeyGen-M, a family of trojanized license key generators that has been identified on approximately 7 % of the Macs running Sophos anti-virus software. Another common source of Windows malware on Macs today is fake Windows Media movie or TV files. These files contain auto-forwarding Web links promising the codec needed to view the video, but deliver zero-day malware instead. Moreover, the Windows’ partitions of dual-boot Macs can indeed be infected, as can virtualized Windows’ sessions running under Parallels, VMware, VirtualBox, or even the open source WINE program. Windows Media files generally would not run on Macs, but Mac users often torrent these files to improve their “ratios” on private tracker sites, without realizing that the contents are malicious .

Windows’ users then attempt to play the videos and become infected. The top Windows malwares found on Macs are shown in Fig. 1. For detecting malwares, a number of non-signature based malware detection methods have been proposed in recent times. These methods typically use heuristic analysis, behavior analysis, or a combination of both to identify malware. Such methods are being robustly explored because of their ability to detect zero-day malware without any a priori knowledge. Some of them have been incorporated into the existing commercial off-the-shelf anti-virus products, but have attained only limited success [4, 5]. The most imperative deficiency of these methods is that they are not real-time deployable. Also, McGraw and Morrisett [2] note that classifying malicious code has become more and more complex as newer versions appear to be combinations of those that belong to existing categories.

Fig. 1
figure 1

Sophos 7-day snapshot of 100,000 Macs (April 2012)

2 Related Works

In this paper, we focus on comparative study of many supervised-learning-based soft computing techniques for malware detections by extracting the portable executable features extracted from certain parts of EXE files stored in Win32 PE binaries (EXE or DLL). These are meaningful features that might indicate that the file was created or infected to perform malicious activity. Among the features to be extracted are data extracted from the PE header that describes physical structure of a PE binary (e.g., creation/modification time, machine type, file size), optional PE header information describing the logical structure of a PE binary (e.g., linker version, section alignment, code size, debug flags), import section details, export section details, resources used by a given file, and the version information. String features are based on plain text strings that are encoded in program files (such as “windows,” “kernel,” “reloc”) and possibly can also be used to represent files similar to text categorization problem. Entropy of interpretable strings that is a discriminating feature was also considered in this work.

Shultz et al. [6] extracted DLL information inside PE executables. They retrieved the list of DLLs, the list of DLL function calls, and the number of different function calls within each DLL. RIPPER, an inductive rule learning algorithm, is used on top of every feature vector for classification. They have done experiments on a dataset that consists of 206 benign and 38 malicious executables in the PE file format.

Shafiq et al. [7] presented the “PE-Miner” framework where 189 structural features are extracted from executables to detect new malwares in real time. They evaluated their system using the VX Heavens and Malfease datasets and obtained a detection rate of 99 %. Ye et al. [8, 9] analyzed the Windows APIs called by PE files to develop the intelligent malware detection system using object-oriented association-mining-based classification. The association among the APIs captures the underlying semantics for the data which are essential for malware detection. They first constructed the API execution calls, followed by extracting OOA rules using OOA fast FP-growth algorithm. Finally, classification is based on the association rules generated. They have tested their scheme using 29,580 executables of which 12,214 are benign and 17,366 are malicious executables.

Wang et al. [10] defined the behaviors of an executable by observing its usage of DLL and APIs. Information gain and SVMs were applied to filter out the redundant behavior attributes and select the informative features for training a virus classifier. The model was evaluated by a dataset containing 1,758 benign and 846 viruses. Shaorong et al. [11] applied associative classification based on FP tree using Windows APIs called by PE files as feature set. They proposed an incremental associative classification algorithm. They mined 20,000 executables as training data and validated the accuracy of the classifier on 10,000 executables. Sami et al. [12] extracted the API call sets used in a collection of PE files and generated a set of discriminative and domain interpretable features. These features are then used to train a classifier to detect unseen malwares. They have achieved detection rate of 99.7 %. But the problem with API call feature extraction is it is very time-consuming.

3 Proposed Framework for Comparative Study

A generic framework for malware detection has been proposed, which is elaborately discussed in this section.

3.1 National Software Reference Library and Reference Dataset

The National Software Reference Library (NSRL) is a project supported by the US Department of Justice’s National Institute of Justice (NIJ), federal, state, and local law enforcement, and the National Institute of Standards and Technology (NIST) to promote efficient and effective use of computer technology in the investigation of crimes involving computers. It was designed to gather software from different sources and incorporate file profiles into a reference dataset (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations. The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set, which may be considered malicious, i.e., steganography tools and hacking scripts.

3.2 Generic Malware Detection Framework

Figure 2 depicts the generic framework for malware detection studied and evaluated in this work.

Fig. 2
figure 2

Generic malware detection framework

There are two work flows shown in Fig. 1: Path 1 (* marked) and an auxiliary, Path 2. In Path 2, mainly followed for forensic analysis, all the files in a system are hashed and subsequently compared with the file hash in RDS library. If a match is found, a file is directly classified into a known category, which is harmless. This preprocessing reduces the overall collection of files, which needs to be analyzed with a soft-computing-based classifier. Path 1 can be followed for real time detection of executable files by directly evaluating against the classifier for categorizing into malware and benign files, without doing any hashing.

3.3 Three-Stage Malware and Benign File Classifier

The classification described in the generic malware detection framework is divided into three-stage process as shown in Fig. 3.

Fig. 3
figure 3

Three-stage malware and benign file classifier

Stage 1—Feature Extraction

Table 1 shows the widely researched feature set in the malware research literatures. For our study, 2, 3, and 4 features were extracted from the malware and benign files and 2 and 3 were finally used for training the classification algorithm.

Table 1 Widely used features in malware detection

The decisions for using the features 2 and 3 are made based on the computation time and the complexity in extracting these features and also due the large number of Windows-based malware in the wild.

A total of 188 features were extracted from the Win32 portable executable (PE) header, to mention a few: creation/modification time, machine type, file size, linker version. Then, a total of 4 entropy values as mentioned in Table 2 were calculated from the interpretable strings, which is a novel idea used in this work. Analysis showed that the entropy was higher in malicious files rather than in benign files. A total of 191 features were extracted, and a comprehensive dataset was formed. Table 2 gives the overview of the features.

Table 2 Feature description

When analyzing the dataset, some features have a discriminatory behavior as shown in Table 3. It is seen that the DLLs Winsock and Winnet are networking DLLs, which are made use of largely in malware files. The COFF file header contains important information such as the type of the machine for which the file is intended, the nature of the file (DLL, EXE, or OBJ), the number of sections, and the number of symbols. It is interesting to note in Table 3 that a reasonable number of symbols are present in benign executables. The malicious executables, however, either contain too many or too few symbols. The interesting information in the standard fields of the optional header includes the linker version used to create an executable, the size of the code, the size of the initialized data, the size of the uninitialized data, and the address of the entry point.

Table 3 Discriminatory PE features

In Table 3, the values of major linker version and the size of the initialized data have a significant difference in the benign and malicious executables. The size of the initialized data in benign executables is usually significantly higher compared to those of the malicious executables. The Windows-specific fields of the optional header include information about the operating system version, the image version, the checksum, the size of the stack and the heap. It can be seen that their values are significantly higher in the benign executables.

Stage 2—Feature Selection

For building robust learning models for any soft-computing-based classifier, the foremost step is to have a strong and meaningful feature set. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models. Feature selection was done using PCA, information gain, and genetic search. The above-mentioned feature selection methods were applied to the 191 features extracted from each file, and the corresponding dataset was formed. Further, these datasets were exhaustively used with many soft-computing-based classifiers, which are discussed below. After few trial runs with different combinations of the feature set, 34 features were finally selected.

Stage 3—Classifiers

Pattern recognition and classification refer to an algorithmic procedure for assigning a given piece of input data into one of the given number of categories. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. In this work, there are many soft-computing-based classifiers used such as artificial neural network, Bayesian network, decision trees along with ensemble of classifiers with boosting.

4 Experimental Results

A total of 66,713 malicious files were collected from a very extensive repository of malwares in the Internet know as VX Heavens [15]. In that, around 26,713 files had a valid PE header. After the removal of outlier data, a comprehensive dataset of 20,154 entries consisting of PE feature values and string entropy of benign and malicious files was collected, from which 13,148 entries were randomly chosen to be used as training dataset and remaining 7,006 entries were used for testing.

In particular, for the two-class classification, AdaBoostM1 + random forest give the best classification accuracy of 99.6717 %, as shown in Table 4 and the receiver-operating characteristic (ROC) curve, as shown in Fig. 4.

Table 4 Efficiency matrix—two class
Fig. 4
figure 4

ROC curve for two-class classification

We have also compared the performances of our best classifier with results obtained from related researches, as shown in Table 5. We have achieved a better classification performance with a lesser number of features

Table 5 Comparison with related works

The efficiency matrix for nine-class classification is shown in Table 6.

Table 6 Efficiency matrix—9 class

5 Conclusion and Future Direction

In this work, a comparative study of two- and multi-class classification for malicious executable detection was conducted and evaluated with well-corroborated experiments. Also, a generic framework for detecting malicious executables and a three-stage classifier was proposed. Apart from the PE header features, an entropy-based interpretable string feature was used.

As further future direction, dynamic features along with static features could be combined together to form a hybrid feature set for a more robust detection mechanism. Also, feasibility study could be conducted in order to extent this work to other platforms other than the Windows operating system, which is one of the widely used and exploited operating system.