Comparative Study of Two- and Multi-Class-Classification-Based Detection of Malicious Executables Using Soft Computing Techniques on Exhaustive Feature Set

Sheen, Shina; Karthik, R.; Anitha, R.

doi:10.1007/978-81-322-1680-3_24

Shina Sheen⁸,
R. Karthik⁸ &
R. Anitha⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 246))

1539 Accesses

Abstract

Detection of malware using soft computing methods has been explored extensively by many malware researchers to enable fast and infallible detection of newly released malware. In this work, we did a comparative study of two- and multi-class-classification-based detection of malicious executables using soft computing techniques on exhaustive feature set. During this comparative study, a rigorous analysis of static features, extracted from benign and malicious files, was conducted. For the analysis purpose, a generic framework was devised and is presented in this paper. Reference dataset (RDS) from National software reference library (NSRL) was explored in this study as a mean for filtering out benign files during analysis. Finally, through well-corroborated experiments, it is shown that AdaBoost, when combined with algorithms such as C4.5 and random forest with two-class classification, outperforms many other soft-computing-based techniques.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Comparative Performance Evaluation of Supervised Classification Models on Large Static Malware Dataset

Learning Attack Features from Static and Dynamic Analysis of Malware

Exploring Discriminatory Features for Automated Malware Classification

Keywords

1 Introduction

Contemporary computer and communication infrastructure are exceedingly vulnerable to various types of attacks. The sheer amount and range of known and unknown malware is part of the reason why detecting malware is a complicated problem. A widespread way of inducing these attacks is using malwares such as worms, viruses, trojans, or spywares. Christodorescu and Jha [1] describe a malware instance as a program whose objective is malevolent. McGraw and Morrisett [2] describe malicious code as “any code added, changed, or removed from a software system in order to deliberately cause harm or subvert the intended function of the system.” Vasudevan and Yerraballi [3] describe malware as “a generic term that encompasses viruses, trojans, spywares and other intrusive code.”

In this work, we have considered Windows-based malware for analysis. The popularity of Windows based OS among average computer users is a huge reason why Windows has the most viruses of any operating system. It is also true that Microsoft’s apparent lack of concern for security in the early days made the problem much worse than it had to be. 2012 saw numerous attacks that were devised for Windows vulnerabilities with Windows 7’s malware infection rate as much as 182 %.

According to Sophos security report, many of the malwares found on Macs are also Windows based malware. (http://www.sophos.com/en-us/medialibrary/PDFs/other/sophossecuritythreatreport2013.pdf). Mac users who need occasional access to a Windows program sometimes decide to download it from third parties and may illegally create a license key using a downloadable generator. By doing so, they often encounter malware such as Mal/KeyGen-M, a family of trojanized license key generators that has been identified on approximately 7 % of the Macs running Sophos anti-virus software. Another common source of Windows malware on Macs today is fake Windows Media movie or TV files. These files contain auto-forwarding Web links promising the codec needed to view the video, but deliver zero-day malware instead. Moreover, the Windows’ partitions of dual-boot Macs can indeed be infected, as can virtualized Windows’ sessions running under Parallels, VMware, VirtualBox, or even the open source WINE program. Windows Media files generally would not run on Macs, but Mac users often torrent these files to improve their “ratios” on private tracker sites, without realizing that the contents are malicious .

Windows’ users then attempt to play the videos and become infected. The top Windows malwares found on Macs are shown in Fig. 1. For detecting malwares, a number of non-signature based malware detection methods have been proposed in recent times. These methods typically use heuristic analysis, behavior analysis, or a combination of both to identify malware. Such methods are being robustly explored because of their ability to detect zero-day malware without any a priori knowledge. Some of them have been incorporated into the existing commercial off-the-shelf anti-virus products, but have attained only limited success [4, 5]. The most imperative deficiency of these methods is that they are not real-time deployable. Also, McGraw and Morrisett [2] note that classifying malicious code has become more and more complex as newer versions appear to be combinations of those that belong to existing categories.

2 Related Works

In this paper, we focus on comparative study of many supervised-learning-based soft computing techniques for malware detections by extracting the portable executable features extracted from certain parts of EXE files stored in Win32 PE binaries (EXE or DLL). These are meaningful features that might indicate that the file was created or infected to perform malicious activity. Among the features to be extracted are data extracted from the PE header that describes physical structure of a PE binary (e.g., creation/modification time, machine type, file size), optional PE header information describing the logical structure of a PE binary (e.g., linker version, section alignment, code size, debug flags), import section details, export section details, resources used by a given file, and the version information. String features are based on plain text strings that are encoded in program files (such as “windows,” “kernel,” “reloc”) and possibly can also be used to represent files similar to text categorization problem. Entropy of interpretable strings that is a discriminating feature was also considered in this work.

Shultz et al. [6] extracted DLL information inside PE executables. They retrieved the list of DLLs, the list of DLL function calls, and the number of different function calls within each DLL. RIPPER, an inductive rule learning algorithm, is used on top of every feature vector for classification. They have done experiments on a dataset that consists of 206 benign and 38 malicious executables in the PE file format.

Shafiq et al. [7] presented the “PE-Miner” framework where 189 structural features are extracted from executables to detect new malwares in real time. They evaluated their system using the VX Heavens and Malfease datasets and obtained a detection rate of 99 %. Ye et al. [8, 9] analyzed the Windows APIs called by PE files to develop the intelligent malware detection system using object-oriented association-mining-based classification. The association among the APIs captures the underlying semantics for the data which are essential for malware detection. They first constructed the API execution calls, followed by extracting OOA rules using OOA fast FP-growth algorithm. Finally, classification is based on the association rules generated. They have tested their scheme using 29,580 executables of which 12,214 are benign and 17,366 are malicious executables.

Wang et al. [10] defined the behaviors of an executable by observing its usage of DLL and APIs. Information gain and SVMs were applied to filter out the redundant behavior attributes and select the informative features for training a virus classifier. The model was evaluated by a dataset containing 1,758 benign and 846 viruses. Shaorong et al. [11] applied associative classification based on FP tree using Windows APIs called by PE files as feature set. They proposed an incremental associative classification algorithm. They mined 20,000 executables as training data and validated the accuracy of the classifier on 10,000 executables. Sami et al. [12] extracted the API call sets used in a collection of PE files and generated a set of discriminative and domain interpretable features. These features are then used to train a classifier to detect unseen malwares. They have achieved detection rate of 99.7 %. But the problem with API call feature extraction is it is very time-consuming.

3 Proposed Framework for Comparative Study

A generic framework for malware detection has been proposed, which is elaborately discussed in this section.

3.1 National Software Reference Library and Reference Dataset

The National Software Reference Library (NSRL) is a project supported by the US Department of Justice’s National Institute of Justice (NIJ), federal, state, and local law enforcement, and the National Institute of Standards and Technology (NIST) to promote efficient and effective use of computer technology in the investigation of crimes involving computers. It was designed to gather software from different sources and incorporate file profiles into a reference dataset (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations. The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set, which may be considered malicious, i.e., steganography tools and hacking scripts.

3.2 Generic Malware Detection Framework

Figure 2 depicts the generic framework for malware detection studied and evaluated in this work.

There are two work flows shown in Fig. 1: Path 1 (* marked) and an auxiliary, Path 2. In Path 2, mainly followed for forensic analysis, all the files in a system are hashed and subsequently compared with the file hash in RDS library. If a match is found, a file is directly classified into a known category, which is harmless. This preprocessing reduces the overall collection of files, which needs to be analyzed with a soft-computing-based classifier. Path 1 can be followed for real time detection of executable files by directly evaluating against the classifier for categorizing into malware and benign files, without doing any hashing.

3.3 Three-Stage Malware and Benign File Classifier

The classification described in the generic malware detection framework is divided into three-stage process as shown in Fig. 3.

Stage 1—Feature Extraction

Table 1 shows the widely researched feature set in the malware research literatures. For our study, 2, 3, and 4 features were extracted from the malware and benign files and 2 and 3 were finally used for training the classification algorithm.

Table 1 Widely used features in malware detection

Full size table

The decisions for using the features 2 and 3 are made based on the computation time and the complexity in extracting these features and also due the large number of Windows-based malware in the wild.

A total of 188 features were extracted from the Win32 portable executable (PE) header, to mention a few: creation/modification time, machine type, file size, linker version. Then, a total of 4 entropy values as mentioned in Table 2 were calculated from the interpretable strings, which is a novel idea used in this work. Analysis showed that the entropy was higher in malicious files rather than in benign files. A total of 191 features were extracted, and a comprehensive dataset was formed. Table 2 gives the overview of the features.

Table 2 Feature description

Full size table

When analyzing the dataset, some features have a discriminatory behavior as shown in Table 3. It is seen that the DLLs Winsock and Winnet are networking DLLs, which are made use of largely in malware files. The COFF file header contains important information such as the type of the machine for which the file is intended, the nature of the file (DLL, EXE, or OBJ), the number of sections, and the number of symbols. It is interesting to note in Table 3 that a reasonable number of symbols are present in benign executables. The malicious executables, however, either contain too many or too few symbols. The interesting information in the standard fields of the optional header includes the linker version used to create an executable, the size of the code, the size of the initialized data, the size of the uninitialized data, and the address of the entry point.

Table 3 Discriminatory PE features

Full size table

In Table 3, the values of major linker version and the size of the initialized data have a significant difference in the benign and malicious executables. The size of the initialized data in benign executables is usually significantly higher compared to those of the malicious executables. The Windows-specific fields of the optional header include information about the operating system version, the image version, the checksum, the size of the stack and the heap. It can be seen that their values are significantly higher in the benign executables.

Stage 2—Feature Selection

For building robust learning models for any soft-computing-based classifier, the foremost step is to have a strong and meaningful feature set. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models. Feature selection was done using PCA, information gain, and genetic search. The above-mentioned feature selection methods were applied to the 191 features extracted from each file, and the corresponding dataset was formed. Further, these datasets were exhaustively used with many soft-computing-based classifiers, which are discussed below. After few trial runs with different combinations of the feature set, 34 features were finally selected.

Stage 3—Classifiers

Pattern recognition and classification refer to an algorithmic procedure for assigning a given piece of input data into one of the given number of categories. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. In this work, there are many soft-computing-based classifiers used such as artificial neural network, Bayesian network, decision trees along with ensemble of classifiers with boosting.

4 Experimental Results

A total of 66,713 malicious files were collected from a very extensive repository of malwares in the Internet know as VX Heavens [15]. In that, around 26,713 files had a valid PE header. After the removal of outlier data, a comprehensive dataset of 20,154 entries consisting of PE feature values and string entropy of benign and malicious files was collected, from which 13,148 entries were randomly chosen to be used as training dataset and remaining 7,006 entries were used for testing.

In particular, for the two-class classification, AdaBoostM1 + random forest give the best classification accuracy of 99.6717 %, as shown in Table 4 and the receiver-operating characteristic (ROC) curve, as shown in Fig. 4.

Table 4 Efficiency matrix—two class

Full size table

We have also compared the performances of our best classifier with results obtained from related researches, as shown in Table 5. We have achieved a better classification performance with a lesser number of features

Table 5 Comparison with related works

Full size table

The efficiency matrix for nine-class classification is shown in Table 6.

Table 6 Efficiency matrix—9 class

Full size table

5 Conclusion and Future Direction

In this work, a comparative study of two- and multi-class classification for malicious executable detection was conducted and evaluated with well-corroborated experiments. Also, a generic framework for detecting malicious executables and a three-stage classifier was proposed. Apart from the PE header features, an entropy-based interpretable string feature was used.

As further future direction, dynamic features along with static features could be combined together to form a hybrid feature set for a more robust detection mechanism. Also, feasibility study could be conducted in order to extent this work to other platforms other than the Windows operating system, which is one of the widely used and exploited operating system.

References

M. Christodorescu and S. Jha. Testing malware detectors. In Proceedings of the International Symposium on Software Testing and Analysis, July 2004.
Google Scholar
G. McGraw and G. Morrisett. Attacking malicious code: A report to the infosec research council. IEEE Software, 17(5):33–44, 2000.
Google Scholar
A. Vasudevan and R. Yerraballi. Spike: Engineering malware analysis tools using unobtrusive binary-instrumentation. In Proceedings of the 29th Australasian Computer Science Conference, pages 311–320, 2006.
Google Scholar
F. Veldman, “Heuristic Anti-Virus Technology”, International Virus Bulletin Conference, pp.67–76, USA, 1993.
Google Scholar
J. Munro, “Antivirus Research and Detection Techniques”, Antivirus Research and Detection Techniques, ExtremeTech, 2002, available at http://www.extremetech.com/article2/0,2845,367051,00.asp.
M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the 2001 IEEE Symposium on Security and Privacy (S&P’01), pages 38–49, May 2001
Google Scholar
M. Zubair Shafiq, S. Momina Tabish, Fauzan Mirza, Muddassar Farooq. PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime. In Proceedings of the 2009 Recent Advances in Intrusion Detection (RAID) Symposium-Springer.
Google Scholar
YanfangYe, D. Wang, T. Li, and D. Ye. IMDS: Intelligent Malware Detection System. In KDD ‘07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and Data Mining
Google Scholar
Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, Qingshan Jiang: An intelligent PE-malware detection system based on association mining. Journal in Computer Virology 4(4): 323–334 (2008)
Google Scholar
Tzu-Yen Wang, Chin-Hsiung Wu, Chu-Cheng Hsieh, A Virus Prevention Model Based on Static Analysis and Data Mining Methods, IEEE 8th International Conference on Computer and Information Technology Workshops, 2008.
Google Scholar
Feng Shaorong, Han Zhixue, An Incremental Associative Classification algorithm used for Malware Detection, 2nd International Conference on Future Computer and Communication (ICFCC), 2010.
Google Scholar
A Sami, B Yadegari, H Rahimi, N Peiravian, S Hashemi and A Hamze, Malware Detection based on Mining API Calls, In Proceedings of the 2010 ACM Symposium on Applied Computing.
Google Scholar
M. Siddiqui, M. C. Wang, and J. Lee, “Detecting trojans using data mining techniques.” in IMTIC, ser. Communications in Computer and Information Science, D. M. A. Hussain, A. Q. K. Rajput, B. S. Chowdhry, and Q. Gee, Eds., vol. 20. Springer, 2008, pp. 400–411
Google Scholar
H. Khan, F. Mirza, and S. Khayam, “Determining malicious executable distinguishing attributes and low-complexity detection,” Journal in Computer Virology, pp. 1–11, 2010, 10.1007/s11416-010-0140-6. [Online]. Available: http://dx.doi.org/10.1007/s11416-010-0140-6
VX Heaven http://vx.netlux.org

Download references

Author information

Authors and Affiliations

PSG College of Technology, Coimbatore, India
Shina Sheen, R. Karthik & R. Anitha

Authors

Shina Sheen
View author publications
You can also search for this author in PubMed Google Scholar
R. Karthik
View author publications
You can also search for this author in PubMed Google Scholar
R. Anitha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shina Sheen .

Editor information

Editors and Affiliations

Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
G. Sai Sundara Krishnan
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
R. Anitha
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
R. S. Lekshmi
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
M. Senthil Kumar
Department of Mathematics, Ryerson University, Toronto, Ontario, Canada
Anthony Bonato
University of Basque Country, Paseo Manuel De Lardizalbal 1, San Sebastian, Spain
Manuel Graña

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheen, S., Karthik, R., Anitha, R. (2014). Comparative Study of Two- and Multi-Class-Classification-Based Detection of Malicious Executables Using Soft Computing Techniques on Exhaustive Feature Set. In: Krishnan, G., Anitha, R., Lekshmi, R., Kumar, M., Bonato, A., Graña, M. (eds) Computational Intelligence, Cyber Security and Computational Models. Advances in Intelligent Systems and Computing, vol 246. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1680-3_24

Download citation

DOI: https://doi.org/10.1007/978-81-322-1680-3_24
Published: 27 November 2013
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1679-7
Online ISBN: 978-81-322-1680-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Comparative Study of Two- and Multi-Class-Classification-Based Detection of Malicious Executables Using Soft Computing Techniques on Exhaustive Feature Set

Abstract

Similar content being viewed by others

Comparative Performance Evaluation of Supervised Classification Models on Large Static Malware Dataset

Learning Attack Features from Static and Dynamic Analysis of Malware

Exploring Discriminatory Features for Automated Malware Classification

Keywords

1 Introduction

2 Related Works

3 Proposed Framework for Comparative Study