1 Introduction

With the popularity of mobile devices and computer networks, software systems have played a critical role in all aspects of our society. Meanwhile, software vulnerabilities arising from software significantly impact businesses and people’s lives [1, 2]. A recent study has pointed out that the Internet suffered from nearly 800 million malware attacks in the second quarter of 2018, which reached a high record [3]. Moreover, most of the attacks can be attributed to vulnerabilities in software. Additionally, the number of vulnerabilities reported publicly to the Common Vulnerabilities and Exposures database (CVE) has increased annually, with the number reported in 2021 hitting 20,000.

Identifying vulnerabilities before deploying software is an effective solution to reduce potential losses caused by malicious attacks [4]. To identify vulnerabilities effectively, researchers have proposed many detection methods which can be categorized into static, dynamic, and hybrid techniques [5]. Static techniques, such as rule/template-based analysis [6], static symbolic execution [7, 8], and code similarity detection [9,10,11], analyze given programs based on source code, and the high false-positive rate is a significant limitation of these techniques. Dynamic techniques analyze given programs by generating specific input data, often accompanied by low code coverage [12]. Finally, a given program is analyzed with a mixture of static and dynamic techniques in hybrid techniques [5]. However, they also suffer from the limitations of both approaches [13]. These methods effectively improve the efficiency of vulnerability detection to a certain extent. However, due to the significant growth of software codes in size and complexity, these solutions fail to satisfy the increasing need for more efficient and effective detection due to the high demand for manual analysis [14].

In order to improve the efficiency and effectiveness of vulnerability detection techniques, many pattern recognition and machine learning (ML) techniques have been widely used to build defect prediction models [15,16,17]. Based on pioneer studies, researchers have selected source-code-based features such as function call [11], software complexity measurement, and code change [18] as indicators to predict the vulnerable code fragments based on ML approaches. However, ML-based techniques still require experts to define indicators explicitly [19, 20]. Furthermore, it is difficult to reflect on the complex and variable vulnerability patterns and discover new vulnerabilities using these indicators.

The emerging deep learning (DL) approaches offer new potential for software vulnerability detection (SVD). On the one hand, DL approaches could extract high-level features automatically, relieving experts from tedious feature engineering tasks [21]. On the other hand, the DL approaches usually have better generalization abilities and can improve detection performance. [22, 23]. It could discover latent features that a human expert might never consider including and represent them in high-dimensional space [24, 25]. Therefore, DL has found its applications in SVD, and the DL-based SVD has become a promising field.

Researchers’ goal is to make the vulnerability detection system like an experienced expert to judge whether a piece of code is vulnerable so that developers can be assisted in identifying and fixing vulnerabilities more efficiently. The SVD methods based on DL are capable of reasoning and understanding code semantics, which shows the possibility of achieving this goal. Researchers are presently pursuing the potential of the DL approach to increase the accuracy of SVD, as indicated by the growing number of scholarly articles (see Fig. 1). The success of DL for SVD expresses the need for an inclusive review of the literature for successive researchers to continue to contribute to this promising field.

Fig. 1
figure 1

Recent growth in the number of DL-based SVD scientific publications

Table 1 Summary of related survey

How is this survey different from others? Although several related surveys have been published in recent years on SVD, few of them have deeply analyzed how researchers allow full play to the advantages of DL techniques in this field. Existing reviews failed to cover the most recent studies that revealed new research directions because the field of DL for SVD is rapidly advancing. Table 1 shows the difference in scopes, focused topics, and the number of reviewed DL-based SVD works of those reviews. It can be seen from the table that the existing surveys rarely concentrate their work on DL-based SVD. They restrict their surveys to conventional ML-based approaches [27, 5, 31] or traditional SVD techniques, including static and dynamic analysis [26, 28,29,30]. Therefore, there is a need to reveal the trend and progress of the application of DL for SVD. The survey conducted by Lin et al. [4] is the closest to our work as they have specifically reviewed DL publications in vulnerability detection and examined how to facilitate the understanding of code semantics by DL techniques. However, due to the rapid development of this research field, their work, which reviewed only 19 relevant papers, can not cover many critical recent advancements. Along with DL-based SVD’s rapid ascent in popularity, a comprehensive review inclusive of more papers would be of great value for the researchers to gain deeper insight into these advancements. Hence, in our work, we review 48 studies that apply DL for vulnerability detection to provide a comprehensive picture of the 5 year advancements in this field of research. Finally, we provide a comprehensive discussion on future directions and challenges for this new area of research.

How did we select the papers? DataBase systems and Logic Programming (DBLP) and Google Scholar are the two primary databases containing papers in the computer field. We search for the relevant papers in these two databases using several keywords, including vulnerability/faulty detection, bug, source code, and DL. Furthermore, the paper selection focuses on English publications from high-quality journals and conferences. It ensures that the selected papers have promising innovations in SVD. Our work retrieved more than 90 DL-based SVD researches published in recent five years. However, we finally selected 48, only those published with noteworthy contributions to the field.

Contributions of this survey Our research intends to comprehensively assess the literature on DL-based SVD and demonstrate recent advances in the field. In addition, it could serve as a guide for the researchers to breathe how DL techniques are applied for solving different aspects of the SVD problems and know the limitations and future directions of this area. We summarise the significant contributions as follows:

  1. (1)

    We review 48 recent DL-based SVD studies, presenting a research trend in this active field.

  2. (2)

    We identify a gap between human understanding and vulnerability detection systems, defined as the perception gap. From the perspective of the perception gap, we categorize how existing studies contribute to bridging this gap by combining the human experience with the advancement of DL.

  3. (3)

    We compare the papers published in 2016–2017 with those published later and discuss current limitations, challenges, and opportunities of DL-based SVD, covering recent achievements. Moreover, we conducted several experiments based on real-world data sets to clarify our views.

The paper is organized as follows. In Sect. 2, we first analyze the shortcomings of current SVD methods, then review the DL-based SVD methods in the early stage, and summarize its primary process. Section 3 reviews the remaining papers and categorizes them according to their research contents, focusing on analyzing how the researchers bridge the gap between DL-based SVD methods and human understanding from the proposed methods and solutions. Section 4 elaborates on lessons learned and future research directions in the field of DL-based SVD. Finally, Sect. 5 concludes the paper.

2 The gap between human understanding and vulnerability detection systems

2.1 The dilemma and potential of vulnerability detection systems

Designing a vulnerability detection system is to find vulnerabilities hidden in software and assist in completing the code inspection [32]. Moreover, the data-driven vulnerability detection system needs to analyze code from the level of code semantics and syntax. However, the structure of vulnerability codes is very complex, which could be evidenced by CVE-2017-11176. The diff file of CVE-2017-11,176 is shown in Fig. 2.

Fig. 2
figure 2

The diff file of CVE-2017-11,176

Fig. 3
figure 3

Involved function of CVE-2017-11,176

The vulnerability CVE-2017-11,176 is located in the Linux kernel. In this function, message queuing allows asynchronous event notification. When a message is placed in an empty queue, the message queue allows for a signal or a thread to start. This asynchronous event notification calls mq_notify function. And mq_notify creates or removes asynchronous notifications for the specified queue. Because mq_notify, the notify function, did not set the socket pointer to null when entering the retry process, which may cause a use after free (UAF) vulnerability. There are more than 20 functions related to the trigger process of the vulnerability, shown in Fig. 3, and the code that needs to be modified to fix the vulnerability is not wholly consistent with the code that caused the vulnerability.

It is difficult for the vulnerability detection system to understand the causes of such complex vulnerabilities, even with an expressive and specially-crafted model and sufficient training data. For a human expert, gained experience in code inspection, knowledge of the program, and understanding of the programming language are also required to identify such vulnerability. Therefore, there is a gap between human understanding and vulnerability detection systems.

Many studies have mentioned this problem and pointed out that the key is that manual features can not ultimately present code semantics and syntax information [33,34,35]. Therefore, a previous review [4] presented a concept of the semantic gap, which is defined as: ”The semantic gap is the lack of coincidence between the abstract semantic meanings of a vulnerability that a practitioner can understand and the obtained semantics that an ML algorithm can learn”. The researchers believe that DL-based vulnerability detection systems can narrow the semantic gap by learning intricate patterns and high-level representations that reveal the code semantics of software codes [36, 37].

The gap between human understanding and vulnerability detection systems is reflected in the understanding of code semantics and the cognition of the whole vulnerability detection task: People’s understanding of the task is to find vulnerabilities. However, the detection systems can only find similar codes to the provided vulnerability samples for training but cannot analyze whether these codes are vulnerabilities.

In addition, some vulnerabilities may be triggered in specific conditions. The same piece of code may be identified as vulnerable when processing some particular tasks. Thus, experts can find vulnerabilities by analyzing differences between the process of codes execution with the actual requirements. However, the current data-driven vulnerability detection methods lack these bits of knowledge.

Therefore, the gap between human understanding and vulnerability detection systems cannot be covered entirely by the definition of the semantic gap because this definition can only cover the gap between the detection system and human experts in the comprehension of code semantics without covering the gap in the comprehension of code relevant information. Section 2.2 will review preliminary researches in DL-based SVD. On this foundation, we will further analyze the perception gap in Sect. 2.3.

2.2 Preliminary researches in DL-based SVD

Because the application of the DL model does not rely on expert-defined features, it has been favored in various fields [38,39,40], such as speech recognition [41, 42], image recognition [43, 44], and machine translation [45, 46]. Furthermore, DL has become the basis of the most advanced artificial intelligence applications [47]. So far, Deep Neural Networks (DNN) have also been applied to SVD, and their detection performance is encouraging. In this section, we would like to discuss how the preliminary studies apply DL to vulnerability detection.

(1) Summary of recent works: To our knowledge, the first study that utilized DNN for SVD was proposed in [32]. Subsequently, some SVD methods based on DL were published within two years [22, 32, 34, 48,49,50,51], forming a relatively complete process.

Table 2 Reviewed studies which published in 2016–2017

In the first study [32], the authors adopted a deep belief network (DBN) for detecting bugs and defects in Java source code. Since Abstract Syntax Trees (ASTs) provide a structured representation of the source code function and reserve more syntactic information than source code, ASTs represent the semantic and syntactic information hidden in the source code. Their ASTs contain function nodes, declaration nodes, and control-flow nodes in three types of nodes. The reason why they excluded other AST nodes was to prevent diluting the importance of other nodes. Then, the authors adopted the method proposed in [53] to reduce the noise in the data set. To input ASTs into the DBN network, the author maps AST nodes to tokens and uses the method proposed in [54] to limit the input tokens between 0 and 1. This paper adopted a generative graphical model DBN to learn vulnerability code representation from a labeled data set. DBN contains one input layer and several hidden layers, and the top layer is the output layer. Each layer consists of several stochastic nodes. The author assumes that this multi-layer structure can enable DBN to reconstruct the semantics and content of input data with high probability and learn the representation of vulnerabilities. However, this paper’s methods only work at the file-level, and it can not pinpoint the vulnerabilities related to code lines.

In research [49], a Convolutional Neural Network (CNN) was adopted to generate semantic and structural features of the source codes. Moreover, the features were combined with 20 traditional features, which were extracted by Jureczko et al. [55] to distinguish whether there was a vulnerability in the file. The AST they used was the same as that of [32]. Thus each source file was represented by a token vector. Subsequently, they adopted a CNN to extract semantic and structural features from source codes. Moreover, they combined the extracted features with traditional features. Then, the authors applied a Logistic Regression classifier [56] to judge whether an input test code was buggy.

A function-level AST-based approach was proposed in [22]. Compared with file-level vulnerability detection, it can pinpoint the key code better. The authors collected more than 6000 labeled functions from three open-source projects. For cross-project vulnerability discovery, a Bi-directional Long Short-Term Memory (BiLSTM) network was adopted to learn code representations in this study. Since ASTs and the source codes lack control flow information and can not reflect the control dependencies, a method, CNNs over Control Flow Graphs (CFGs) for SVD, was proposed in [34]. The authors collected four real-world data sets from a popular programming contest site CodeChefFootnote 1 for conducting experiments.

Previous studies used source code as training data, with expensive manual marking consumption. An approach for predicting memory corruption vulnerabilities was proposed in [52], and the authors extracted features from both static and dynamic analysis. The authors thought that the usage patterns of the C library functions could exhibit the memory corruption vulnerabilities, and they extracted the call sequences/traces from a set of call sequences associated with the standard C library functions and the monitor of programs’ execution for a limited period. Later, a study that also used function call sequences as features was proposed in [51]. Similarly, they used the method proposed in [52] to get function call sequences. Nevertheless, the difference was that they used a popular multi-purpose fuzzer zzufFootnote 2 to detect unexpected behavior to acquire the label of every program. The disadvantage of this work was that the quality of the data set could not be guaranteed, and the data label method is only suitable for some types of vulnerabilities.

(2) Discussion: Given existing research achievements, compared with ML, even a simple DL model such as Multilayer Perceptron (MLP) in [32] has a better detection effect in various SVD tasks, which benefits from learning high-level features or representations with more complexity and abstraction. Moreover, DL techniques allow the detection systems to capture code semantics, understand contextual code dependencies, and automatically extract high-dimensional features that better reflect vulnerabilities’ essence. With these capabilities, although the constructed SVD systems are not enough to dig out various vulnerabilities in source codes like human experts, DL-based SVD systems have had a better performance than all other types of data-driven vulnerability detection systems. At present, DL-based SVD technology is still in its infancy. DL technology is expected to realize automatic and intelligent vulnerability mining with continuous improvement.

2.3 Vulnerability perception gap

Although researchers have applied various SVD methods based on DL, several recent studies have demonstrated that the accuracy of DL-based SVD could be up to 90% at detecting vulnerabilities in experimental scenarios. Nevertheless, in some actual detection environments, their performance dropped by more than 50% [57]. The current DL-based SVD methods are still limited in their applications.

Hence, there is a gap between human understanding and vulnerability detection systems. We call it the perception gap and define it as follows:

The perception gap is a lack of consistency between practitioners’ cognition of object code and decision-making of the DL model.

This gap has two aspects: on the one hand, due to the black-box nature of many DL-based methods for SVD, people can not comprehensively understand the decision process and reasons for the DL-based detection system; on the other hand, there are still shortcomings in the DL system compared to human experts.

The lack of sufficient data is one of the most critical reasons [4, 58]. DL methods usually need large data sets, especially for complex tasks like vulnerability detection. However, the known public data set, SARD,Footnote 3 is not collected for vulnerability detection. Therefore, the lack of sufficient data for training leads to the DL method not being able to extract complete vulnerability features. Moreover, due to not understanding the detection task like a human, the in-depth learning method needs to be further optimized to achieve better results in various scenarios. In addition, the lack of fine-grained and interpretable DL-based SVD systems is also the reason for this gap.

How can this gap be bridged? If an ordinary person wants to be an experienced expert, an overall and local understanding of tasks of interest is essential and it is also true for DL-based SVD methods. Therefore, researchers integrate their understanding of vulnerability detection tasks into the deep learning model by optimizing each step of DL-based SVD. Therefore, the detection system can understand vulnerability detection tasks like human experts.

Fig. 4
figure 4

Overview of DL-based SVD process

As shown in Fig. 4, DL-based SVD methods have formed a relatively complete process, including data collection, code representation, model building, and evaluation/test.

Data collection is to collect labeled data sets for training neural networks. In this part, the critical point is the grain of high quality labeled data. In general, fine-grained vulnerability labels can better locate the code. However, the size of current vulnerability data is relatively small. Therefore, the insufficient training data can not meet the requirements.

Feature representation can be divided into code representation and word embedding methods. Code representation methods aim to express the semantic and grammatical information that the source codes miss, and word embedding methods aim to turn code representations into vectors that neural networks can process. For most neural networks, the code representations can be well input into deep network training, such as Word2vec [59, 60]. However, due to the complex software background and expert experience related to the vulnerabilities, current code representation methods can not completely express the semantic and grammatical information that the source codes miss.

Model building applies or customizes deep neural network models that automatically extract the vulnerable patterns to detect potential vulnerabilities. The model structure determines the learning ability of the model for different types of data. Therefore, a model with better learning ability and is easy to explain is expected.

Evaluation/test is to train and test the built detector in specific application scenarios. The DL model needs to be optimized for different scenarios. The current research mainly focuses on feasibility and theoretical research, and many problems need to be solved in practical application [57].

Therefore, DL-based SVD can better complete the vulnerability detection task by combining the experts’ understanding of the vulnerability detection task, which can bridge the perception gap.

3 DL-based SVD for bridging the perception gap

DL has been extensively used in natural language processing, such as machine translation and language understanding, and is also suitable for code semantic analysis, which can help human experts screen possible vulnerability codes. It prompted many researchers to follow up on the DL-based SVD methods to improve the vulnerability detection ability.

3.1 Human experience facilitating DL-based SVD

Section 2.2 has introduced early DL-based research, mainly focusing on realizing a complete detection system. This section will focus on how researchers optimize the application of DL models in vulnerability detection.

The origin of the perception gap is that the deep learning method can not obtain the relevant knowledge of software background and expert experience, and this knowledge can not extract from the training data [61]. However, relevant knowledge is essential for vulnerability mining. Therefore, researchers analyze and extract related knowledge and try their best to input this knowledge into the DL-based SVD models. Thus, this section divides the reviewed papers into four main optimization directions for DL-based SVD. Next, we will introduce the four directions.

Improvements in the quality of data sets: It mainly aims to optimize data acquisition, labeling, and processing methods, which objectively improve the quality of data sets and reduce the demand for computing power. The deep learning models are able to accurately extract vulnerability features with improved data quality and more accurate labeling.

More suitable feature representation methods: Feature representation can be divided into code representation and word embedding methods. In this research direction, the researchers select the code representation and word embedding methods ideal for detecting different types of vulnerabilities according to their experience. Combined with the mighty computing power of the DL model, these code representation methods could further improve the feature extraction ability of DL methods. At the same time, a suitable word embedding method can better retain the semantics information of code representation.

Neural networks with improved learning ability: On the one hand, researchers can improve the learning ability of neural networks by optimizing the structure of networks or increasing the scale of networks. On the other hand, feature input neural networks can also be optimized to retain more semantic and grammatical features.

Optimization for specific scenarios: Optimize for the specific problems encountered in the actual detection environment, such as the software to be tested does not allow a view of the code, and the computing power of the detection environment is insufficient.

The rationale behind the proposed categorization is as follows: Integrating more expert experience into each step of DL-based SVD methods can narrow the perception gap between DL-based SVD methods and human experts. The following sections will review some essential works to illustrate how researchers bridge the perception gap in different directions.

3.2 Improvements in the quality of data sets

This section shows essential works that optimize data sets’ quality from three aspects: data source, granularity, and label quality.

Table 3 Reviewed studies in Sect. 3.2

(1) Summary of recent works: In [62], A new data labeling method is proposed, and the authors collected a extensive data set with 12 million source code based on this method. They used three open-source static analyzers, Clang,Footnote 4 Cppcheck,Footnote 5 and Flawfinder.Footnote 6 However, the data label may not be consistent with the actual label, and the method can not guarantee the quality of data.

A statement-level data generator method was proposed for detecting buffer overflow vulnerabilities in [63]. Compared with the research [48], it provided a synthetic code generator that could generate codes that can be compiled normally, and their data sets have control flow structures and code line numbers. The author used the libclang interface to split the code files into statement-level codes.

Then a statement-level based detector called VulDeeLocator was developed by [64]. The authors used intermediate codes as the program’s representation to detect vulnerability at the slice-level. The authors obtained intermediate codes by the Lower Level Virtual Machine (LLVM). The extraction and segmentation of statement codes needed a tool, and the location of the vulnerability codes was not precise enough. Then, a line-level classification method called Vulcan was proposed in [37]. The authors investigated the problem of classifying a line of the program as containing a vulnerability or not using ML.

Then, An extension method of [22] was proposed in [65]. In this work, the authors used the real-world data set collected from GitHub and proposed a novel fuzzy-oversampling method to address the non-vulnerable data insufficient issue. To provide sufficient real data, the authors provided more labeled code from nine different open-source software in research [67]. The granularity of the data set covers function-level and file-level. At the function-level, it contains 1,471 labeled vulnerable and 59,297 labeled non-vulnerable source code functions. And at the file-level, it contains 1,320 vulnerable and 4,460 non-vulnerable. The experiment results were conducted on the proposed real-world data set and SARD data set with different network structures. In [67] a deep domain adaptation method was proposed to solve the problem of lacking enough labeled data. To overcome the lack of labeled vulnerability data, the authors adopted a semi-supervised variant to fully utilize the unlabeled target data’s information by treating the unlabeled target data as the unlabeled component in semi-supervised learning. Subsequently, they use spectral graphs [72] to represent the geometry of data and optimize the output results via minimizing the conditional entropy [73] of the source and target distribution.

Research [67] has shown that semi-supervised learning has the potential to alleviate the lack of large-scale labeled data sets effectively. Thus, it can provide more fine-grained labeled data and make the DL model locate vulnerability code more accurately. However, it is equally important to accurately distinguish the type of vulnerability, which can more accurately explain why this code segment contains a vulnerability. A recent study [69] proposed a multi-class vulnerability detection can effectively solve this problem. The source code was converted to token sequences in the processes, and the authors applied an Long Short-Term Memory (LSTM) network to classify vulnerabilities.

Later, a more efficient multi-class vulnerability detection was proposed in [19]. First, they collected a data set containing 116 different types of vulnerabilities and 33,086 test cases. Then, code gadgets containing data dependencies, control dependencies, and the ”global” semantics related to possible vulnerabilities were used. And ”code attention” was proposed to focus on ”localized” information to detect specific vulnerability types.

Previous studies have greatly improved the vulnerability granularity but also made it more difficult to label source codes. In order to solve the labeling problem of fine-grained data, a differential analysis-based approach called D2A was proposed in [70]. The authors built their data set by analyzing version pairs from multiple open source projects. They select bugfix submissions from each project and statically analyze the versions before and after submission. The detected issues, which disappear in the corresponding after-commit version, are likely to be real bugs. They used this method to generate a large labeled data set.

In addition, a data set collection method was proposed in [57]. To obtain the labeled data set, the authors collected the already fixed issues with publicly available patches of open source software, such as Linux Debian Kernel and Chromium. Later, Wang et al. proposed an automatic data labeling method in [71] that can automatically obtain data from GitHub. It further reduced the labor cost of data set collection. They conducted an automatic framework for collecting vulnerable code samples. In this framework, a set of predictive models or experts were used to predict whether a code commits relevant to a code vulnerability. Moreover, the vulnerability code segment can be identified by comparing the different versions of code before and after the code commit.

(2)Discussion: There are three main improvements in the quality of data sets. The first is to find a better data source. In Sect. 2.2, most of the data sets of reviewed papers are not from the actual application software or do not use the source code, for example [48, 52, 51]. Compared with the actual software code, synthetic or semi-synthetic codes possess a simple structure. Some synthetic codes even can not be compiled normally. The vulnerability characteristics extracted from them can not meet the actual vulnerability detection requirements. In addition, the granularity of these data is at the program-level or file-level, which can not meet the needs of accurately locating vulnerabilities.

The second is to improve the granularity of data. Studies in this section significantly improve vulnerability labels’ accuracy from file-level to function-level or statement-level. At the same time, specific vulnerability types have been marked in some review papers, such as [69, 19]. These have greatly improved the data quality.

The last is to improve the marking and cleaning methods of data. More than half of the studies extracted some or all data sets from open source software projects compared to previous studies. They optimized the quality of data labels by manual effort [64,65,66,67] or by analyzing the differences in software codes in different versions [70, 57, 71].

3.3 More suitable feature representation methods

This section shows some critical works that optimize feature representation methods from two aspects: code representation and word embedding methods.

Table 4 Reviewed studies in Sect. 3.3

(1) Summary of recent works: In order to fully extract the semantics of programs, Fan et al. combined static measurement methods with ASTs, in [61]. They believed that ASTs could reflect the original structure of source code and reserve more semantic information. And the authors combined these semantic features with traditional static metrics to improve the performance of SVD. With similar intent, another approach was proposed in [74]. The authors consider two types of complementary features for vulnerability detection. The first was CFGs generated by Clang and LLVM [84], Moreover, the second was based on source codes directly. The authors convert two sets of features to token sequences. Then, they converted the generated two types of token sequences into vector representations, using a word package model and word2vec [59, 60] model.

Paper [75] proposed a new code representation method by embedding code comments. The author believed that comments of codes could reflect the semantics and functions of source codes. Thus, the semantic features could be extracted from those comments. Mainly because some codes in the real world lack comments, the author’s testing modules did not contain comments. In this way, the classifier could cope with the missing comments situation. Therefore, comments were only fed into the trained model during the testing process.

An approach proposed a program representation called ”code gadget” for detecting vulnerabilities in [24]. A code gadget is several lines of code that are semantically related in terms of data dependency or control dependency. Therefore, the code gadget defined could be used to capture the vulnerabilities related to data flow or control flow dependencies. In addition, the authors used a business tool called CheckmarxFootnote 7 to generate code gadgets.

In a recent study [76], Li et al. further extended the ”code gadget” adopted in [24] and their ”code gadget” contains both data dependency and control dependency of code sequences. Furthermore, they implemented the study based on an extended open-source parser Joern [13]. Compared with Checkmarx used in [24], it could accommodate new semantic information of programs.

Then, a name-based bug detection approach for detecting JavaScript bugs was proposed in [77]. It detects accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations. The authors think that the names of variables and functions contain helpful information for the three types of vulnerabilities. Similarly, a name-based vulnerability detector was also used in C/C++ and Python programs’ vulnerability detection in [78]. The authors thought function names contain important semantic features to distinguish vulnerability functions in source code. However, the function names usually can not provide enough information, and the detector can not accurately locate the vulnerability.

The reviewed studies have shown that ASTs and CFGs effectively represent the semantic and syntactic information hidden in the source code. Later, a representation method that merged ASTs, CFGs, Data Flow Graph (DFGs), and Program Dependence Graphs (PDGs) was proposed in [35]. Data and control dependencies are made clear in a representation known as PDG, which uses graph notation. These dependencies are considered during the dependent analysis phase of compiler optimization, which improves parallelism and uses many cores. The authors called it to code property graph (CPG) and stored it with a joint data structure [13]. Then, an Intermediate Representation (IR) based method was proposed in [79]. The authors thought that as an intermediate code representation containing data and control information, IR could extract vulnerability characteristics in different programming languages.

In addition, an automated and intelligent vulnerability detection method was proposed in [80], and the minimum token sequences representation was used for code representation. The minimum token sequences representation can ensure that more information is input into the neural network and improve long codes’ detection ability.

Although many code representation methods have been adopted in vulnerability detection, few studies pay attention to the performance of code representations. Later, an evaluation of vulnerability detection performance on code representations was proposed in [81]. To evaluate the performance of different code representations, the authors proposed a DL framework consisting of 3 DNNs in conjunction with five different representations. The framework contains ASTs, Code Gadgets (CGs), Semantics-based Vulnerability Candidates (SeVCs), Lexed Code Representations (LCRs), and Composite Code Representations (CCRs). Their experiments concluded that the CCRs had the best overall improvement.

Different from the previously reviewed papers, a Kernel Based Extreme Learning Machine (KELM) model that focuses on the optimization of vector representation was proposed in [82]. This paper adopted a multi-level word embedding method to represent the features of code structure better. Specifically, the authors obtained the symbolic representation of the source code related to the vulnerability through three kinds of symbolization and introduced Doc2vec [85] for vector representation. Thus, it can significantly reduce the noise introduced by irrelevant information about vulnerable codes.

In addition, Hin et al. proposed a new deep learning system, LineVD [86], for detecting statement-level vulnerabilities as a node classification problem. LineVD used a transformer-based approach, CodeBERT [83], to encapsulate the raw source code tokens and Graph Neural Network (GNN) to utilize control and data dependencies between statements.

(2)Discussion: This section mainly divides the feature representation method’s improvement into two parts. On the one hand, it improves the form of code representation. The source codes or tokens extracted from the source codes can be directly used as a code representation method. However, due to various reasons, such as software updates, the programming language contains much redundant information, and the order of the programming language may be different from the actual execution order in computers. These factors may affect the learning efficiency and effect of the deep learning model. Therefore, researchers applied ASTs, obtained from the compilation process of source code, as code representation methods. After compiling, ASTs remove redundant fragments in the source codes and restore the execution order of source codes. Based on ASTs, researchers increased control flow information and data flow information to AST, forming various representation methods, such as Code gadgets [24, 76] and IRs [79]. In addition, studies [35, 61, 74] combined various different representations to input more features into neural networks. These studies have proved that richer representations help to improve the performance of neural networks.

On the other hand, it mainly enhances the embedding methods. The embedding methods can be divided into non-contextual and contextual embedding technology. Non-contextual embedding technology, such as Word2vec [59, 60] and GloVe [87], converts each word in the text into a separate high-dimensional vector representation. Contextual embedding technology, such as CodeBERT [83], considers the context information and can recognize polysemy and similar terms according to the context. The effects of different word embedding methods such as Doc2vec [82], GloVe, and CodeBERT on vulnerability detection results in the same scenario are compared in research [86]. The results show that context embedding technology such as CodeBERT can effectively improve the results of vulnerability detection.

3.4 Neural network with improved learning ability

This section shows some works that focus on optimizing neural networks for SVD.

Table 5 Reviewed studies in Sect. 3.4

(1) Summary of recent works: In order to extract more comprehensive features from source code, a method was proposed in [68]. In this study, a global max-pooling layer was used to capture the most critical features of vulnerabilities. Similarly, the authors used the AST of function as the original feature. They believed that it could better distinguish the semantic information of different sequences in high-dimensional space. However, some vulnerable and non-vulnerable code is hardly distinguishable, resulting in low detection accuracy. Therefore, an attention mechanism was adopted in paper [88] to capture the critical features of the vulnerabilities. And the Code Property Graph (CPG) was used to obtain semantic features in this framework. A data structure called the CPG was created to explore big codebases for examples of programming patterns. These patterns are expressed in a language that is unique to the area. It acts as a unified intermediate program representation for all of Joern’s supported languages. Then, a similar study that uses an attention mechanism was proposed in [89].

After that, DP-Transformer with improved learning ability was applied in software defect prediction in [90]. Their transformer network consists of stacked self-attention and position-wise, fully connected layers for both encoder and decoder. After each input sequence was inputted to the encoder, the decoder would generate a symbol output of an element. Especially, only the encoder part of the transformer was used to extract features from the source code.

After that, the Bert model was also applied to SVD in paper [91]. The authors adopted a neural network including 12 transformer blocks and a softmax layer. Their training includes two stages: pre-training on English Wikipedia data set [94] and fine-tuning process on vulnerability detection tasks. In the fine-tuning process, the SARD data set was used for vulnerability detection tasks.

The reviewed studies proved that the CNN and Recurrent Neural Network (RNN) were able to learn high-level representations for software defect prediction. Moreover, they have shown that graph-based code representations, such as AST and CFG, could represent semantic and syntactic information. However, the studies mentioned above convert the code representations to sequences before feeding them to the deep network instead of processing their original tree/graph form. The study proposed in [14] changed the state. The authors combined AST, CFG, DFG, and code sequence into a joint graph to comprehensively represent the semantic and syntactic information. Then, the gated graph recurrent layers [95] were adopted to learn the input graph structure. The main idea of this method was to combine multiple code representation methods to obtain more dense local features. Then, another GNN based method was proposed in [33]. The authors believed that structured information could better retain vulnerability features. Moreover, they applied a graph-based neural network to capture the graph representations of code explicitly. In this paper, the ASTs and control-data flow graphs (CDFGs) were used for code representations. Later, Cao et al. [93] conducted a Bidirectional Graph Neural-Network (BGNN) to improve the performance of DL-based vulnerability detection approaches.

(2)Discussion: In this section, some researchers strengthen CNN or other networks by introducing new mechanisms. For example, Lin et al. [68] adopted a global max-pooling layer to capture the most important signals; Duan et al. [88] combined an attention mechanism with CNN to capture the critical features.

Some researchers have used advanced neural networks for vulnerability detection, such as Transformer [90] and Bert [91]. It is worth noting that more and more researchers have applied GNN to SVD, such as [14, 33, 93]. Compared with other neural networks, GNN can retain more structural features [4].

3.5 Optimization for specific scenarios

This section shows some essential works that focus on optimizing specific scenarios in DL-based SVD.

Table 6 Reviewed studies in Sect. 3.5

(1) Summary of recent works: The cold-start problem is common in DL-based methods, and it means that ML tools usually can not play a good role due to the lack of high-quality data sets. Liu et al. proposed a method to break the dilemma of insufficient data sets in [96]. The author hoped to learn the common vulnerability features in the same type of data sets to perform the test set without labels better, and They adopted a metric transfer learning framework (MTLF). In MTLF, the target domain’s Mahalanobis distance metric is computed by maximizing within-class covariance and minimizing between-class covariance. It could avoid the influence caused by the distribution difference between the target domain and the source domain.

A different method for improving the efficiency of cross-domain detection was proposed in [97]. First, the authors tested to bridge the distribution divergence between source and target projects by combining adversarial learning with discriminative feature learning, extracting the transferable semantic features from source code. In order to achieve this goal, they trained an Adversarial Discriminative Convolutional Neural Network (ADCNN) model. There were two independent training stages. The labeled source data was used to train the source encoder and source classifier in the first stage. Moreover, in the second stage, the authors trained the target encoder to make the target data representation similar to the source data representation by fooling the discriminator. Finally, the authors fed the features generated into a Logistic Regression (LR) classifier. The experimental results demonstrated that the proposed method performs better compared with other related cross-project defect prediction methods. Later, a extended method of [97] was proposed in [98]. The authors pointed out that the method proposed in [97] has negative impacts on the predictive performance due to the mode collapsing problem of the GAN principle. To tackle this problem, the authors adopted a Dual Generator-Discriminator Deep Code Domain Adaptation Network (Dual-GD-DDAN).

It is also very important to apply the detection system to an Integrated Development Environment (IDE) environment. In this way, the vulnerability can be corrected as soon as possible. A tool integrated with IDE as a plugin was developed in [99]. This tool worked in the background and could label vulnerability codes in the IDE environment. In this tool, ASTs created from the source code were used as the deep representation, and a three-layer neural network was used as a classifier. Therefore, this tool could detect code vulnerabilities in real-time during software development. The experiments were conducted on both open-source codes and Cisco codebases for C and C++ programming languages. The results showed that the method was an assuring approach for predicting vulnerabilities.

An interpretable model is also urgently needed. The DL models are considered black boxes because of the difficulty in explaining the relationship between input and output. Therefore, it is not conducive to the understanding and confirmation of the output results. A method to determine the influence of local input on output was proposed in [100]. The authors combined two techniques to realize this method: 1) Syntax-Directed Attention; 2) Code Perturbation. Specifically, they used the attention mechanism score as the standard to determine the impact of the disturbance code on the output results by observing the change in the attention mechanism score. The experiments on more than 1000 programs indicated that attention scores could explain the output of DL-based SVD models.

(2)Discussion: This section mainly introduced the works which solve some specific problems in DL-based SVD, such as cold start [36], cross-project detection [96], and the interpretability of neural networks [100]. Among them, cold start and cross-project vulnerability detection are very real problems. Because in the actual detection, the software to be detected lacks labeled vulnerability data. It will seriously limit the vulnerability effect of software vulnerability detection. The interpretability of DL models will affect the confirmation and repair of software vulnerabilities. The current research methods can alleviate these problems to a certain extent, but they still face many limitations in the virtual environment. The research on vulnerability detection based on DL is still in its infancy. We hope that more research can explore and solve these practical problems.

4 Challenges and future directions

This section conducts experiments based on real-data sets and draws conclusive remarks on research challenges and future trends based on previous works. The computational system used was a server running Ubuntu LTS 22.04 with two Physical Intel(R) Xeon(R) E5-2683 v4 2.00GHz CPUs and 32GB RAM with NVIDIA RTX 3090 GPUs. The main models involved in experiments were from GitHub.Footnote 8

4.1 The lack of large-scale real-world benchmark data sets

The DL-based SVD methods need training on large-scale real-world data sets to achieve optimal performance [4]. At present, the lack of large-scale data sets containing high-quality vulnerability labels limits the research progress in this field. Table 7 summarizes the characteristics of a few popular software vulnerability data sets used by reviewed works.

Table 7 Software vulnerability data sets collected by the reviewed studies at the time of writing

It can be seen from Table 7 that more than half of the articles use synthetic or semi-synthetic data sets. The SARDFootnote 9 data set is the most widely used data set because this data set is open and easy to obtain and has a relatively large scale. However, this data set was initially designed for evaluating traditional vulnerability prediction based on static and dynamic analysis [102]. Therefore, the source codes of this data set are simplified and isolated. Research [57] compared The SARD data set with the real-world data set they collected. They find that the SARD data set and real-world data sets differ significantly in code complexity measurement. The main drawback is that the code patterns lack diversity compared to the code from real-world programs.

In addition, other synthetic or semi-synthetic data sets also have similar problems, such as CJOC-bAbI [48] and s-bAbI [63]. Besides, the scale of these two data sets is relatively small, and the source codes of the data set CJOC-bAbI cannot even be compiled. Furthermore, the data set PROMISEFootnote 10 cannot be found on the website provided in the study [32] due to the lack of maintenance for a long time. We extract the data set from the website mirror on GitHub.Footnote 11 This data set only provides the static features extracted from the software source codes from the obtained data. Therefore, it is not conducive to furthermore research based on this data set.

The synthetic or semi-synthetic data sets do not fully capture the complexities of real-world vulnerabilities [102,103,104]. Many existing works created self-constructed data sets based on different criteria. However, only a few fully released their data sets. Lin et al. released the data set used in their study [66]. Due to the manual extraction of vulnerability source codes based on CVEFootnote 12 information, the real-world data set provided by Lin et al. has been used in many studies, such as [67, 98]. In addition, high-quality labels make this data set have the potential to become a small-size benchmark data set at function-level.

Table 8 The number of vulnerable and non-vulnerable functions on three data sets
Table 9 Cross-domain test results of real-world data set Lin and semi-synthetic data set SARD
Table 10 Test results of real-world data set Lin in function-level and file-level

However, due to labor costs, the scale of Lin is small. Table 8 lists the test results of the detection ability based on the real data set Lin, and the semi-synthetic data set SARD. The details of the data sets are shown in Table 9, there is no duplication between the training set and the test set (The bold value indicates the best performance in the same performance metric of different methods). It can be seen from the experimental results that the detection results based on real data sets are significantly better than semi-synthetic data sets. However, it still with high false positive rates due to insufficient data scale. For more general cases, large-scale real-world benchmark data sets are still needed. Such data sets could facilitate all research works in this field, and the comparative experiments carried out on the data set could fairly reveal the differences between different works. Of course, the granularity and label quality of the data sets are as important as the data scale. It can be seen from table 10 that the deep learning models trained based on the file-level data set performed better than the models trained based on the function-level data set.

However, other real-world data sets, such as Draper [102], D2A [70], have a large scale, but the quality of the labels needs to be verified. For example, the labels of the data set Draper are mainly from static detection tools. And the labels of data set D2A are from the analysis of software source codes in different versions. Although their labeling method is reasonable, there is still a big gap between their methods with labeling based on CVE information.

Indeed, there may be potential risks in open source vulnerability data sets, but a large-scale vulnerability data set with a high-quality label is essential. Federated Learning (FL) may alleviate this problem for some data sets that are not suitable to release because this method can share features without sharing data sets [105].

4.2 Effective code representations

In order to optimize the performance of DL-based SVD, researchers have proposed a variety of code representation methods to provide neural networks with richer semantic and syntactic features. Furthermore, it has been proven that code representation methods can preserve more useful structure information of source code, resulting in the best performance balance between precision and recall [102, 106].

The current researchers attempt to integrate various code representation methods to contain more information to extract vulnerability features better. For example, some studies extract features from source code-related information such as code comments [75] and binary files [52]. These methods play a positive role in improving detection performance.

Table 11 Test results of real-world data set Lin with different code representations
Table 12 The number of vulnerable and non-vulnerable functions on different code representations

In addition, many researchers use code structure analysis to optimize code representation methods, such as AST [68]. A tree representation of the abstract syntactic structure of a source code written in a formal language is known as an AST. Each node of the tree indicates a construct that appears in the text. Moreover, they add source codes’ data flow, control flow, and other structural information to the AST structure, forming various code representation methods such as CPG [88], CDFG [33], code gadgets [92]. All possible routes through a program during execution are referred to as a program’s control flow. And the data flow monitors the control flow’s use of variables.

Table 12 lists the number of different code representations generated by Joern based on the same data set. The number of generated code representations is not the same as the number of source code samples. It is because the code representations of some source codes are empty. Table 11 shows that the detection method trained based on source code has even better performance than that based on the single code representation methods. Although the combined code representations perform better than single methods, the scale of the code representation to input neural networks is also limited due to hardware limitations. Thus, while adding these pieces of information to AST, researchers also need to constantly remove the features of low values, forming a dense feature representation.

How to extract the most important features from combined representations to input them into the neural networks full is research worthy of attention. However, detecting different vulnerabilities needs to retain features due to the diversity of vulnerability patterns. Therefore, customizing appropriate code representation methods for specific types of vulnerabilities may contribute to the detection performance of specific vulnerabilities. For example, at present, there are code representation methods suitable for buffer overflow vulnerability detection [63]. However, for most other types of vulnerabilities, there are no specialized code representation methods. Therefore, designing the optimal code representation methods for specific programming languages and vulnerability types may be an important research topic in future.

4.3 Humanoid DL model

Researchers’ goal is that the neural model can detect the vulnerable code like human experts do and pinpoint the relevant code leading to the vulnerability. In order to achieve this goal, more complex models with stronger learning abilities have been applied to this field, from simple such as DBN [32], MLP [52] to relatively complex such as CNN [49], LSTM [22]. It can be seen from tables 13 and 14 that even the DNN model has better performance than the traditional detection tools: Flawfinder and Cppcheck. However, DL-based SVD methods have poor stability in cross-domain detection due to the difference in data distribution.

Table 13 Test results of methods, trained and tested based on real-world data set Lin
Table 14 Cross-domain test results of methods, trained based on real-world data set Lin, tested based on real-world data set REVEAL

Besides, these models cannot focus on and fully learn important features or pinpoint the critical code lines that affect the output results. Therefore, the attention mechanism with this ability has been sought after by researchers. For example, research [88] and [89] applied the attention mechanism on CNN and RNN networks and proved that the attention mechanism is indeed helpful in improving the performance of DL-based SVD. In addition, study [37] and [100] used the attention mechanism to calculate the weight of different code lines, which showed that the attention mechanism helps pinpoint the critical code lines that lead to vulnerabilities.

In addition, Transformer [90] and Bert [91], which are completely composed of attention mechanisms, have also been applied in this field. As a result, these models have stronger learning abilities than the previous models. In research [91], the authors trained the BERT model in English writing data set and then used this model to detect vulnerabilities.

We believe that humanoid DL models maybe appear with the continuous development of neural networks, which can help human experts complete most of the tedious vulnerability detection work. At present, many advanced deep learning models such as meta-learning [107] have been proposed. Moreover, some of them are especially suitable for software code analysis, such as CodeBERT [83]. Due to the limitation of the hardware platform, we did not train the CodeBERT model locally but used the embedding model trained by the research [83] to fine-tune it. We hope that research can fully tap the potential of this model in the field of SVD. As these models are applied to the field of SVD, the SVD methods can further improve the detection accuracy.

4.4 Semantic retention in neural networks

Notably, the neural models may not be able to capture the code semantics as human experts do.

On the one hand, the loss of semantic information is inevitably in the process of model training. However, the semantic retention of each neural network model is different. For example, previous studies have shown that the CNN network has advantages in local feature extraction [49, 80], and the BiLSTM network is better in long sequences’ feature extraction [24, 69]. However, in the field of vulnerability detection, people’s understanding of the preference of different neural networks for semantic retention is not clear [76]. How to select a suitable neural network to preserve relevant semantic features of vulnerabilities needs further research.

On the other hand, semantic loss occurs more before the code representations are input to neural networks. Due to hardware limitations, the code representations are usually limited to a fixed length in the input process. It means that the code representations which exceed the size will be truncated [68]. In real-world data sets, there are a lot of lengthy codes. When using these codes for training or detection, there is a severe problem of semantic loss. However, this review has not found the research dealing with this problem.

Table 15 Test results of real-world data set REVEAL in slice-level

At the same time, many code representation methods, such as AST and CFG, are graph structures. However, for most neural network models, such as RNN, the graph structure needs to be transformed into a sequence to input. As a result, the structural features may be lost in the process. GNN model can directly input structured information, which has attracted the attention of researchers [14, 33, 93]. From the experimental results in table 15, it can be seen that the GNN network does have a better learning ability at the slice-level granularity. However, due to the input limitations of the GNN, the code needs to be sliced and processed into a graph structure. High-quality labels for slice-level data sets are difficult to meet. The performance of the test results of GNN trained by slice-level data set is not as good as other neural networks trained by function-level data sets. SVD based on GNNs still needs further research.

Table 16 Test results of BiLSTM trained based on real-world data set Lin with different embedding methods

Besides, embedding methods also deserve to be attention. Table 16 shows the test results of the same BiLSTM model trained based on different embedding methods. The performance of GloVe and Doc2vec models perform poorly. It is because GloVe focuses on word co-occurrence, and Doc2vec hopes to extract sentence vectors and article vectors. However, compared with natural languages, the semantic information of programming languages is mainly contained in naming functions and variables. So it is not easy to learn semantic features for GloVe and Doc2vec. Similarly, the performance of the CodeBERT method is also reduced. However, because the CodeBERT model is pre-trained on large-scale code, it learned richer semantic information than GloVe and Doc2vec. In the cross-domain detection, as shown in Table 17, CodeBERT achieved the best performance.

On the whole, how to better retain the semantic information of code needs further research.

Table 17 Cross-domain test results of BiLSTM, trained based on real-world data set Lin with different embedding methods, tested based on real-world data set REVEAL

4.5 Vulnerability detection in the cross-environment

Another problem worthy of attention is that the current DL-based vulnerability detection methods have a narrow range of applicability.

One is cross programming language environment. For example, most studies are limited to detecting part of vulnerabilities (such as buffer overflow) written in several mainstream programming languages (such as C, Java). Due to software source codes usually involving multiple programming languages and vulnerabilities, the detection ability of these methods can not meet the detection requirements. Zou et al. [19] tried to develop a multi-class vulnerability detection system to cover most classes of vulnerabilities. However, limited by the small number of data samples, the actual detection effect of these vulnerabilities is still not guaranteed. Besides, Li et al. [78] developed a detection system that can detect vulnerabilities in different languages at the same time. However, the model’s design is still based on these mainstream languages, which can not guarantee the same detection efficiency in other languages. Therefore, it is worthy of further study on the DL-base SVD for multiple programming languages and multiple types of vulnerabilities.

Table 18 Cross-domain test results of different methods, trained based on real-world data set Lin, tested based on real-world data set REVEAL

The other is cross-project environments. Due to the differences in dependent function libraries, DL-based SVD methods often have low accuracy in cross-project detection [68]. However, cross-project detection is inevitable in the actual detection environment because the software code to be detected usually has no label [96]. In order to alleviate this problem, researchers have proposed some cross-project detection methods. For example, Liu et al. [96] learned the cross-project representation by minimizing the distribution difference between the source and target domains to improve cross-project detection efficiency. Furthermore, Sheng et al. [97] tested to bridge the distribution divergence between source and target projects by combining adversarial learning with discriminative feature learning. However, as shown in table 18, although these studies have improved cross-project vulnerability detection performance, the reduced accuracy problem can not be completely avoided. Therefore, more research on cross-project vulnerability detection is needed to better meet the actual detection needs.

5 Conclusion

With the advancement of artificial intelligence technology, a software vulnerability detection system based on deep learning may achieve autonomous and intelligent vulnerability mining, successfully avoiding the issues of large false-positive and false-negative vulnerability rates. This research comprehensively evaluates available deep learning-based software vulnerability detection algorithms and examines the perception gap. At the same time, it covers the current state of research and trends in software vulnerability detection approaches based on deep learning methods to address this issue. Finally, this field’s difficulties and prospects have been identified.

We believe that the two most important problems are to be solved in this field. The first is the lack of large-scale public data sets, making it difficult for various methods in the current field to compare under objective conditions. The second is the cross-environment vulnerability detection method because only by ensuring the detection result in the real environment can this method be applied to the real world. We hope that more research will focus on solving these problems in future.