Keywords

1 Introduction

Recently, tools for generating malware have been spreading rapidly on the internet, making it easier for people without expertise to create malware. As a result, the number of new malware variants is increasing quickly. According to AV-Test malware trend report, the number of recent malware has been increasing, but the number of new malware has not changed much. Most of the recent malwares are variant of existing malware [1]. Most of the new malware use either a polymorphic method for compressing and encrypting existing codes or a metamorphic method for transforming files to detour signature pattern matching-based malware detection employed in obtaining traditional anti-virus solutions. Currently, various malware analysis methods are being studied in response to the generation of malware variants. In the present study, we propose a malware classification method using machine learning, for which attribute data applicable to learning is extracted from malware data. This is followed by an analysis of the obtained results.

2 Training Data Set

To carry out this study, we use the Microsoft Malware Classification Challenge data set (BIG 2015) hosted by Kaggle in 2015 [2]. For each data set, approximately 10,000 samples are provided in byte and disassembly file format to both the learning and the test data. Each data set is contained in one of the classes listed in Table 1.

Table 1. BIG 2015 data set classes

3 Feature Extraction

This section discusses the feature data extracted that is to be used for machine learning. In the present study, we extract the section name data of sequences, API, image data, and instruction and assembly code as features to use for learning. Each feature is explained in the following subsections.

3.1 Sequences

Malware files can be classified by analyzing byte sequences as they maintain binary format. For sequence analysis, the N-gram method is widely used. In the N-gram method, a long string is divided into multiple strings of size N, and then a statistical technique is employed to analyze the pattern of each fragment. In this study, we set N to be 4 and extract 4-gram data in the binary files of the malware sample data to use for learning [3].

3.2 Application Programming Interface (API)

API information regarding where a specific file is being imported can be identified by PE file analysis. API information is related to actual performance for PE file execution; thus, it is a key indicator for malware analysis. However, many recent malware variants exploit obfuscation; hence, API information to be extracted from files is limited. This makes it difficult to use API information alone for learning and requires its use in conjunction with other features [4, 5]. In the present study, we measure the number of API calls in the malware sample files and store it to use as learning data (Fig. 1).

Fig. 1.
figure 1

API call extraction.

3.3 Section Name

The PE file is configured with pre-defined section names, including .text, .data, .idata, .edata, .rdata, .rsrc, .tls and .reloc [6]. Each section includes a code to execute a program, a global variable, and DLL information for importing and exporting resource-related data. However, the PE file is changeable because file characteristics are not mandated. While normal files tend to maintain the format of section names, malware uses an arbitrary section name for its obfuscation. Accordingly, section names are extracted as a feature to be used for learning. In our research, we extract the section names existing in the learning data and count the number of each file’s section lines to use for learning.

3.4 Instruction

The binary file is difficult to intuitively understand and thus becomes tricky to analyze. For more effective analysis, the assembly code of a file needs to be considered. In fact, it is in the assembly code that instruction information can be found [7]. The sequence can be analyzed using the N-gram method as one does in a byte file. In this study, we measure the frequency of instruction to use it as feature data.

3.5 Image Representation

Because malware variants are visually identical to existing malware via imaging, recent studies have carried out the imaging of malware and then employed image-treating algorithms such as convolution neural networks (CNNs). Figure 2 shows the method for imaging malware. Imaging is performed by transforming each byte in the malware binary file into an 8-bit vector and subsequently turning it into malware, which is represented by the greyscale image in Fig. 2, with values between 0 and 255 [8].

Fig. 2.
figure 2

Visualizing malware as an image.

Imaging is possible for both binary files and assembly files of malware sample data. In the present study, the images of the assembly files are clearer than those of the binary files, and thus they are used as features. Figure 3 shows the image pattern of each class after imaging malware learning data. In this study, we extract 8-bit vectors from the malware binary files and store them to use as learning data.

Fig. 3.
figure 3

Training a malware image.

4 Learning and Performance Evaluation

This section discusses the results of extracting the feature data discussed in Sect. 3 from malware sample data and employing machine learning. In the present study, we conduct learning using random forest, an ensemble technique-applied machine learning algorithm, followed by an assessment of the algorithm’s performance. Random forest is an algorithm that creates multiple decision trees during training and performs classification and prediction of the mean value. In this study, 75% of the entire data is used as a training set for the experiment, and the remaining 25% is used as a test setup to assess performance. In addition, the out-of-bag (OOB) score is measured while applying the random forest to assess performance (Fig. 4).

Fig. 4.
figure 4

Random forest simplified.

Table 2 shows measured performance. After training the model using the training set, the performance was measured using the test set, and the accuracy was measured as 99.8%. Likewise, we measured the performance using OOB data and achieved an accuracy of 99.5%.

Table 2. Evaluation result

Figure 5 shows a confusion matrix. We have confirmed that most of the data are correctly classified.

Fig. 5.
figure 5

Confusion matrix

5 Conclusion

Recently, many studies have been carried out to deal with the rapid emergence of malware variants. In the present study, we propose a method for classifying malware using machine learning and conduct related experiments. After performing the learning procedure using the Microsoft Malware Classification Challenge data set, we analyze its performance, and an accuracy of 99%. In future research, we will develop new methods for improving the performance of malware classification. In addition, we will explore how to develop a data set using both normal and malware files to analyze malware-detection performance.