Keywords

1 Inroduction

In this digital world, where modern technologies like 5G, Internet of Things (IoT), and artificial intelligence (AI) lead to the advancement and innovation of digital society. However, privacy and security breaches pose major challenges as cybercriminals attack the users computer and networks for stealing sensitive data, spy on the infected system, or take control of the system for self-gain  [1]. The attackers use malware (malicious software) to gain access to the target system. Malware is a software, code, or program which performs malicious actions. The term malware is used to generalize any form of malicious software and code. It can get different names based on behavior and purpose like virus, Trojan, adware, worm, and spyware. Malware analysis is used to understand the behavior of malware and also helps in the detection process. Currently, the analysis of the malware process is signature-based or behavior-based, but these are proven to be time-consuming as well less effective in identifying unknown malware in real time. This paper aims to propose a novel architecture that combines the concept of machine learning and deep learning which effectively detects zero-day malware.

1.1 Research Background

When the very first computer virus appeared in 1988–89, antivirus software were designed to detect only the known viruses by searching the virus definition databases which is updated time to time; this method is called signature-based detection. But the challenges with this approach are virus variants use different types of obfuscation which hides the viruses signature. Hence, signature-based method are less efficient in terms of detecting the zero-day attack  [2]. Signature-based analysis needs domain-level knowledge to reverse engineer the malware using static and dynamic malware analysis techniques. These techniques are used to identify the important features of the malware which helps in signature-based detection. These methods take larger time to reverse engineer the malware; during that time, hackers might take many valuable information. It is also a resource-extensive method.

Many potential researchers have identified that hackers use obfuscation methods to against signature-based detection. To tackle this problem, software are used to manually unpack the file and analyze the APIs calls. But this process is resource-intensive. In  [3], author presented a model which automatically extract the APIs call and analyze the binary in four-step. In step 1, unpacking of malware. In step 2, disassembling of binary. In step 3, extraction of APIs call, and in step 4, APIs call mapping and feature analysis. This work was further enhanced in  [4] by adding a extra step using machine learning. SVM is used with n-gram feature extraction from both goodware and malware binary with tenfold cross validations. In  [5], author proposed a hybrid model which combines support vector machine (SVM) and maximum-relevance minimum redundancy filter (MRMRF) with API calls feature for enhanced malware detection. With the increase in malware variants due to obfuscation, recently many potential researcher are improving the malware detection methods  [6]. This forms the motivation of this research.

2 Related Work

Machine learning algorithms works on feature engineering, selection, and representations. The set of features of different class is used to train the model in order to create a plane of goodware and malwares. This plane helps to classify the malwares and goodwares. Both feature selection and engineering requires domain level knowledge. Various features can be obtained by static and dynamic analysis explained in Sect. 3 of this paper.

The problem with classical machine learning-based malware detection system is that they rely on the feature engineering, learning, and feature representation  [7,8,9] and once an attacker have the knowledge about the features used in model, the malware detector can be easily bypassed  [10].

To be accurate, machine learning algorithms requires variety of data. The publicly available data for malware analysis is very less due to privacy and security concerns, and each available data has their own limitations. Many researchers prepare their own datasets and preparing their own dataset by using data science explained in  [11] for research is a daunting task. These are the major limitations for developing a machine learning-based malware detection system that can be used in real time.

Nowadays, deep learning models, an improved model of neural networks better performed compared to machine learning models in many of the task in the field of natural language processing, robotics, and others  [12]. In training phase, it tries to grab high-level representation of features in hidden layers with the capability to learn from mistakes. These are  [7,8,9, 13,14,15,16,17,18,19,20] are the few research studies which uses the application of deep learning models for malware analysis.

3 Methods for Malware Analysis

3.1 Static Analysis

In static analysis, executables are analyzed without actually executing them. It is the very first and less risky process and does not require any safe environment or sandbox for analyzing them. Static analysis involves the analysis of the internal structure of the program. It involves various steps: (a) Determining the file type of the malware: It helps in identifying the malware’s target operating system and architecture. (b) Fingerprinting the malware: By fingerprinting means generating the hash value based on its file content. It helps in identifying whether this particular malware is identified before by searching in multi-anti-virus databases like VirusTotal. (c) Extracting Strings: Executable strings can be extracted using the string utility tool available in the linux system. Extracted strings can give clues about the program functionality and indicators associated with a suspect binary. (d) Determining file obfuscation: Obfuscation is a method used by the malware authors to hide the inner working of the binary. Packers and cryptors are obfuscation methods used by the malware authors.

3.2 Dynamic Analysis

Dynamic analysis is the way toward extricating data from malware while it is running. Not at all like the restricted view, the static analysis gives of the malware being broke down, powerful examination offers a more top to bottom view into the malware’s capacities since it is gathering data while the malware is executing its capacities and orders. To lead dynamic malware analysis, two things are required: malware test environment and dynamic analysis tools.

A malware test environment is a framework where malware is executed with the end goal of examination. It should comprise of a working framework that the malware is composed for and should have most, if not all, of the conditions the malware needs to execute appropriately.

The dynamic analysis tool, otherwise called framework checking apparatuses, is the one observing the malware test environment for any progressions made by the malware to the objective framework. A portion of the progressions that are observed and recorded remember changes for the document framework, adjustments in setup documents, and whatever other important changes that are set off by the malware’s execution. The powerful investigation devices likewise screen inbound and outbound organization correspondences and any working framework assets utilized by the malware. With these tools, the investigator can comprehend what the malware is attempting to never really target framework.

A completely executed malware test climate with the fitting powerful investigation instruments is otherwise called a malware sandbox. A malware sandbox is a place where an examiner can run and notice a malware’s conduct. A malware sandbox can be a solitary framework or an organization of frameworks planned exclusively to break down malware during runtime.

4 PE File Format

The Windows PE document is the record sort of Windows working frameworks beginning in Windows NT and Windows 95. It is called Portable Executable because Microsoft’s vision was to utilize a similar document design in future kinds of Windows, making the PE document basic to all Windows stages paying little mind to what central processing unit (CPU) they support.

The Windows PE document design is gotten from the Common Object File Format (COFF) that was utilized in Virtual Address extension (VAX) frameworks running the Virtual Memory System (VMS) working framework created by Digital Equipment Company (DEC), which was procured by Compaq in 1998 and converged with HP in 2002. The majority of the first Windows NT improvement group came from DEC (Fig. 1). The PE File design comprise of the accompanying:

  • DOS MZ Header

  • DOS Stub

  • PE Header

  • Section Table

  • Sections

Fig. 1
figure 1

PE file format

5 Dataset Description

The dataset is obtained from the publicly available dataset from IEEE Dataport. It contains information from around 48K malware and goodware. The dataset is gotten by exploiting the openly accessible reports from malware administration. It is a free online assistance that does a static and dynamic examination on submitted records utilizing the Cuckoo sandbox, which are then available in an HTML report.

To ensure the credibility of the dataset, we turned to two other online archives: National Software Reference Library (NSRL) and VirusShare.com, these give metadata (for example MD5 hash) in regards to known goodware and malware samples, separately. As NSRL contains an assortment of advanced marks of known, traceable software applications, if an example is available in this assortment, we are more sure it is without a doubt goodware. Then again, VirusShare.com is a vault of malware tests, henceforth an example present in this storehouse gives us higher certainty it is malware.

When the information validness is affirmed, we began the extraction process, where online information is saved locally or in a central database for additional examination. This method is called scraping, and it is done by using Python scraping library. Concerning the NSRL repository, data was given in textual format, which drove us to utilize Pandas, a Python data analysis library, to extract and dissect the information. The data extracted from the PE samples are visualized in three different datasets:

  1. 1.

    PE_Import dataset contains the top 1000 imported function information extracted from the import section of the PE sample. It has 1002 columns in which 1000 columns are the features, one column is for the hash, and one column for the label.

  2. 2.

    PE_Section_Header contains the information of the section header of .text, .data, .code, and code section of the PE sample. It has 6 columns in which 4 columns are the features, one column is for the hash, and one column for the label.

  3. 3.

    PE_Raw_Image contains the raw PE byte stream re-scaled to 32\(\,\times \,\)32 grayscale images of PE sample. It has 1026 columns in which 1024 columns are the pixel value, one column is for the hash, and one column for the label.

6 Model Implementation

Proposed model uses the combination of both machine and deep learning as shown in the Fig. 2 based on static analysis. It uses deep learning for the feature extraction process and classical machine learning model for the classification process. For the PE_Section_Header, we used fully connected artificial neural network, for the PE_Import, we also used fully connected neural network, and for Raw_PE_Image, we have used convolution neural network.

For the fully connected neural network, we have used adam optimizer and binary cross entropy as loss function and ReLU as the activation function.

Fig. 2
figure 2

Proposed model

Table 1 Features analysis result

6.1 Performance Evaluation

We have performed various experiments based on the number of features of the datasets. We evaluate the optimal number of features for our model for that we initially used 50 features out of 1000 from the PE Import, 50 features out of 1024 from the Raw PE Image, and 4 features from the PE Section Header. For the selection process, we tuned the second-last layer of the neural network as per our requirements and later stored these intermediate values in a new .csv file using which we create a more informative and efficient datasets. Later, these new datasets are given to machine learning models for the classification process. The experiment results are in Table 1.

The best result we got when we selected 100 features from the Import dataset and 100 from Image dataset and 4 from the Section dataset. We have also performed our experiment using various machine learning algorithm and also deep learning model and compared our model which can be seen in Table 2.

Table 2 Experiment result

7 Conclusion

In this work, we analyzed how ML and DL techniques fit into the scope of malware detection and how could the chosen dataset influence the results of the classifier. We analyzed, trained, and validated multiple models to better understand how laboratory conditions vary from real-world conditions. We compared our model with others models which is based on machine learning, deep learning, and combinations on both, but doing so our model provided us with results as high as 98.91%.

We have also concluded that the model combined with ML and DL both gives better and promising results. Having a solid knowledge of the effects of temporal consistency in the task of malware detection, we improved our base model for better results. This was done by using the DL feature extraction approach to provide the ability to extract information regarding malware classes and by adding more features to the model.

The task we set ourselves to achieve was not without its difficulties, but all in all, we believe our work shows that the path to malware detection via machine learning and deep learning is feasible, not only theoretically, as related work as shown, but also with practical implications.

8 Future Work

In the above work, we used static analysis for the feature analysis and selection; in future, we want to incorporate the dynamic analysis for the feature engineering doing so give us a new prospect toward the datasets and also obtain new feature which can increase the accuracy of the model.