Keywords

1 Introduction

In recent years, with the rapid development of the Internet, the functions of software has brought greater influence. However, the popularization of modern cutting-edge technology has caused more and more complex security problems while providing the convenience for peoples’ life and work [1]. In the process of software development by modules, there are a large number of security. Security vulnerabilities threaten everyone’s personal and property security, and may cause serious consequences such as information disclosure [4]. Even experienced programmers can’t guarantee the absolute security in the process of software development. In the process of vulnerability mining, many aspects need to be considered and studied. Therefore, how to effectively extract the information of vulnerability features is also a topic worthy of in-depth research and discussion.

Compared with other technologies, deep learning performs better in vulnerability mining [6]. Although the application of deep learning in the field of vulnerability mining has achieved a certain number of representative achievements, but the development at this stage is not yet mature. The current research has conducted in-depth research and analysis on the application of deep learning in the field of security vulnerability mining. By summarizing the existing research results of software vulnerability mining based on deep learning [3], we find that there is still a long way to go to realize automatic and intelligent vulnerability mining due to the wide variety of vulnerabilities [8]. Therefore, it can be seen that it is imperative to optimize the design of software vulnerability detection methods. The vulnerability mining model based on mixed representation is applied in the design of software vulnerability detection methods, which is committed to fundamentally improving the efficiency of software vulnerability detection.

This paper is summarized as follows. The second section briefly introduces the working model framework of deep learning applied to data mining. The third section focuses on combing and analyzing the existing feature extraction model methods. The fourth section summarizes and prospects the above work. The main contributions of this paper are as follows: summarize the general framework of deep learning applied to vulnerability mining, classify and describe feature extraction, and look forward to the construction direction of model framework in the future.

2 Relevant Knowledge

In this section, we will give a briefly introduction about the general framework of deep learning-based vulnerability mining. This framework consists of three phases, including data collecting, learning stage and detection phase [2]. The framework of deep learning-based vulnerability mining process is illustrated in Fig. 1. Next, we will give a detailed description about these three phases.

Fig. 1.
figure 1

The framework of deep learning-based vulnerability mining.

2.1 Data Collecting

A good deep learning model needs a large number of training samples. However, when collected training data is insufficient, the obtained model has a shortcoming of over fitting, which is not suitable for validating other data samples [24]. At present, in the existing work, the sample objects collected for different application scenarios and learning tasks include binary programs, PDF files, C/C++ source code, IOT, etc. The collecting models of these data are uneven, such as fuzzy test generation, etc. For many file formats such as DOC, PDF, SWF, etc., it is a common model to obtain the test input set by using web crawler [14]. At this stage of data collecting, we need to collect a large amount of vulnerability data, most of which come from major open source websites [21].

2.2 Learning Stage

Most of the collected data have more or less problems. It can not be used directly, so the data should be continuously processed and expressed as vector input to ensure the effect of vulnerability detection. Generally speaking, the learning stage consists of three parts, including data pre-processing, data representation and model learning. Detailed descriptions are discussed in the following.

Data Pre-processing. Data pre-processing refers to some data processing before main processing. The collected data in the real world are generally incomplete or inconsistent [13], which can not be directly used, easily lead to unsatisfactory mining results. It can be divided into three parts: data cleaning, data integration and data reduction [15].

Data cleaning “cleans” missing values, smoothing noise data, error data, etc. Data integration [24] refers to the process of combining and storing data from multiple data sources to establish a data warehouse. Data reduction [18] is a kind of data mining, the reduction representation of the data set is obtained by using data specification technology.

Data Representation. Security researchers mainly investigate the performance of various aspects of security vulnerabilities and use different methods to build models, but the extraction process is difficult for the diversity of data. In this paper, we divide existing data representation methods into five categories. We respectively review these five methods and briefly summarize their advantages and disadvantages in Sect. 3. This is the key issue to be discussed in this paper.

Model Learning. As an essential factor of vlunerability mining, a good learning model is important. By combing the existing software vulnerability mining literature based on deep learning, this paper finds that most of the works mainly put forward to new vulnerability mining models from the improvement of data representation. It mainly focuses on the classification based on deep feature representation.

2.3 Detection Phase

The process of the detection phase is similar to data pre-processing and data representation. Firstly, we abstract the extracted data representation module and determine the key points of software vulnerability detection. Then we extract the key points in software vulnerabilities and process the software vulnerability detection data. We also eliminate the detection data irrelevant to the key points of software vulnerabilities and learn the model to quantitatively express the characteristics of software vulnerabilities. Finally, input the results to get the vulnerability mining model. So we realize the software vulnerability detection based on the hybrid deep learning model.

3 Data Representation Methods

In this section, we give a detailed analysis on data representation methods, including sequence characterization-based method, abstract syntax tree-based method, graph-based method, text-based method and mixed characterization-based method. Moreover, we divide the above methods into two categories, named single feature extraction method and mixed feature extraction method.

Fig. 2.
figure 2

Five data representation methods for deep learning-based vulnerability mining.

3.1 Single Feature Extraction Model

Work [28] describes the process of distinguishing outliers in data preprocessing. However, researchers found that these data still have defects in the process of work, so more accurate data feature extraction methods are needed for more accurate classification.

Sequence Characterization-Based Method. This method mostly informs by lexical analysis of source codes or binary files. In work [23], the library API function is divided into forward and backward calls, generating one or more slices respectively. But only part of the code can be detected and the exact location information of the vulnerability can not be known. In work [22], the deep neural network technology is used to train the rectifier linear unit by random gradient descent method and batch normalization method. It also predicts the vulnerable software components. In addition, the statistical feature selection algorithm is used to reduce the feature and search space. The evaluation results show that the proposed technology can predict vulnerable classes with high precision and recall rate, and has good vulnerability detection ability.

Abstract Syntax Tree (AST) Method. This method is a tree representation of the abstract syntax structure of source code in the process of program compilation. A Novel Neural Source Code Representation based on Abstract Syntax Tree (ASTNN) is proposed in work [9]. Different from the existing processing methods, ASTNN divides the whole AST into a series of small syntax trees. The data in the grammar book is vectorized in a series of ways. Then the RNN model is used to generate the vector representation of the code.

Graph-Based Method. This method mainly mines security vulnerabilities from different program source codes or binary files [19]. Work [5] attempts to represent the program graphically. They use the syntax and semantic information between PDG edges and use GNN to build a vulnerability mining model. The comparison with the model using less structured program representation shows the advantages of modeling known structures. Work [7] expressed the program structure, syntax and semantics in the form of graph, and analyzed the corresponding structure of the program on this basis.

In work [29], improvements are mainly made in face image extraction. After preprocessing with HSV, hierarchical HMAX was used to extract features, which extracts more features than previous work. In work [30], based on the famous Hessian affine feature extraction algorithm, a new local feature descriptor is proposed. This method is used to adapt automatic remote sensing image, which can resist local distortion and greatly improve robustness. In work [31], they mainly study the feature extraction method for quality classification. Based on Support Vector Machine (SVM), SVM-RFE with filter is constructed for experiment and made good progress.

Text-Based Method. This method is to extract the main information of text content, then we vectorize it and convert it into data information that can be used directly [20]. The combination of text mining and deep learning technology is used to realize vulnerability mining, and deep learning is applied to program analysis, which has achieved good detection performance. Work [5] proposes a coding standard for constructing program vector representation, and builds a word frequency statistical model to describe the Java source file disclosure mode. In the program classification task, they further feed back the representation into the deep neural network, and obtain higher accuracy than the “shallow” models (such as logistic regression and SVM). In fact, they only extract rough syntax and semantic information from program source code information, which limits the performance of vulnerability mining model.

In work [26], we proposed to use the attention mechanism algorithm to adaptively perceive the context information and extract the text features from readers’ emotional changes in the reading process. And the convolution threshold recurrent neural network is used to predict readers’ emotions. Work [32] proposed a feature extraction method based on Bag-of-Matrix-Word (BOMW). It extracts from a matrix dictionary and finally counts the frequency of the matrix words to obtain the middle-layer feature of the MFL data matrixes. This improves the effectiveness of features and recognition speed. The advantage of the proposed method not only improves the effectiveness of the features, but also increases the speed.

3.2 Mixed Characterization-Based Method

The major way of data extraction in recent years is close to the mixed characterization-based model. This model refers to the combination of at least two feature representation methods mentioned above, which has higher performance than the four single feature representation methods. It can be seen from work [17] that the prediction accuracy of the deep learning model constructed by CNN and LSTM is higher than the traditional method in the experiment. But there is still a certain gap compared with the mixed characterization-based model constructed by CNN-LSTM. In work [11], using the intermediate representation technology of low-level virtual machine and CNN-RNN to extract the key information of the source code. The results show that the effect is better than the previous experiments. In Work [7], the features in the candidate set samples are abstracted by combining CNN and NLP for vulnerability mining. Therefore, how to integrate multiple features to realize automation and fine-grained vulnerability mining is a research topic worthy of exploration.

In work [25], a two-channel-network model prediction method is proposed. The processed data are put into two parallel convolutional neural networks for feature extraction, and the combination of CNN and LSTM algorithm is used for prediction. The results are the same as work [17] and the mining performance of mixed characterization-based model is better. Work [27] combines work [25] and work [26]. The attention mechanism, CNN and LSTM are used to build a model for photovoltaic prediction.

3.3 Method Comparison

Table 1 is obtained by comparing models of single feature extraction and Mixed characterization-based method. Among them, the sequence characterization-based method mainly extracts features such as identifiers and operators. However, in practical application, due to the large amount of code, only part of the code can be detected, and the extraction effect is not perfect. The abstract syntax tree-based method is to read the source code, merge them into identification tokens according to the regulations, remove blank symbols, comments, etc., and then perform syntax analysis, convert the analyzed array into a tree, and verify the syntax. In this way, data information can be obtained more deeply through node traversal. But the process involves the whole code base, the detection speed is slow, and some identifiers will be deleted when generating the tree, which does not completely match the source code. The graph-based method is to extract features by generating graph structure, which mainly focuses on the logical relationship between data to ensure the integrity of source code to a certain extent, but the detection speed is slow. The text-based method refers to extracting keywords from the text to replace the text information, but it only extracts the semantics roughly. It is easy to ignore the context structure. In contrast, the mixed characterization-based method combines any two or more of the above methods. They can make up for the shortcomings of each other. It also makes the extracted information richer and more complete. How to integrate multiple methods is a problem worthy of further research.

Table 1. Comparison of single feature extraction model and mixed characterization-based method.

4 Discussion

Traditional machine learning technology needs to extract vulnerability features manually. Then they convert the features into vectors as the input of machine learning algorithm. The technology does not have the ability to automatically extract features from the original data. It heavily depends on expert knowledge for manual work  [21]. In the process of combining deep learning with vulnerability mining in the future, I think there are two aspects can be deeply studied.

One is the efficiency and accuracy of vulnerability mining. Firstly, the process of vulnerability mining depends on computing, which is closely related to software scale, hardware system and analysis technology. In the research process, according to different factors, the corresponding countermeasures should be adjusted better to meet the needs and improve efficiency. The other is the automation and intelligence of vulnerability mining. At present, the depth of many studies depends on the experts to solve problems. Automatic vulnerability mining is the Key points and difficulties of current technology research. It plays an important role in realizing automatic vulnerability mining and even network attack and defense.

5 Conclusion

In this paper, we review some representative deep learning-based works of vulnerbaility mining. These methods can be generally divided into five categories: sequence characterization-based method, abstract syntax tree-based method, graph-based method, text-based method and mixed characterization -based method. Meanwhile, we summarize their advantages and disadvantages from the angles of single and mixed feature extraction. Compared to traditional vulnerability mining approaches, deep learning-based methods can realize automatical vulnerability detection without security experts to pre-define mining rules.

Therefore, this paper believes that for different types of problems, we should construct a vulnerability mining model in line with the actual situation. The mixed characterization-based model can extract data information to the greatest extent. In the future, it is an inevitable trend to add deep learning algorithm into the process of vulnerability mining. Automatic and intelligent vulnerability extraction is of far-reaching significance to all aspects of learning and research. More and more accurate information extraction is the premise and foundation of all this. Vulnerability mining based on deep learning is a topic worthy of in-depth discussion.