1 Introduction

The problem of finding all bugs is undecidable given the complexity of modern programs. Some critical bugs are classified as software security vulnerabilities if they could be exploited to gain unauthorised access or privileges to a system. To secure a computer system, vulnerabilities are normally detected by manual verification with the aid of automated tools. There are some effective techniques such as static analysis, taint analysis, symbolic execution and fuzzing. Some tools combine two or more techniques to achieve better efficiency and precision. Unfortunately, the previous research about software vulnerability detection had shown little impact in practical use. Precision, scalability and efficiency still remain problems for these tools. Most vulnerabilities are still found by fuzzing and manual source code auditing. Recently with the rise of studies in artificial intelligence, vulnerability detection using ontology-based services has received increasing attention.

An ontology, in computer science, is an explicit specification of a conceptualisation [17]. It defines a set of representational terms (e.g. classes, relations, functions) to model a domain of knowledge. The definitions describe the meaning and logical constraints of these terms. Ontologies have been successfully applied in artificial intelligence, especially for information integration, knowledge sharing and semantic understanding. In computer security, ontology services have shown to be helpful in understanding vulnerability requirements, realising source code semantics and constructing vulnerability knowledge.

Although the clear definition of knowledge graph has not come to an agreement since Google first proposed it in 2012 [14], some researchers started to use knowledge graph as a generalised term for any knowledge-based representation and sometimes treat knowledge graph as a synonym for ontology. For this paper, while reviewing articles, we treated ontology and knowledge graph as synonyms to give a broader review.

Using ontologies in software security requirement engineering has been proved to be very useful [6]. Human language is ambiguous and many security terminologies are vaguely defined; thus, it is needed to use concrete terms for software security concerns. Security ontologies provide matrices to analyse whether the software product has met security specifications. For example, security ontologies can check if the user has the right level of permission, or if application programming interfaces (APIs) carry more information than they should. There were many security ontologies proposed; Blanco et al. [8] and Souag et al. [27] produced comprehensive surveys about them. However, these software engineering approaches served as hierarchical reasoning and decision-making tools. They managed secure software design, implementation and maintenance. But they could not discover vulnerabilities caused by programming error, such as buffer overflow vulnerabilities as programmers did not validate memory length.

Compared to software engineering need, this paper takes a different point of view which focuses more on vulnerability detection. Recently, some research transformed source code to ontology representation and analysed software to detect vulnerabilities, combining some traditional concepts. Also there were vulnerability ontologies had been proposed to reason known vulnerabilities and their consequences.

In this paper, first we will give a brief background information on some popular vulnerability detection techniques and open challenges (Sect. 2). Next, we introduce how existing studies used ontologies to detect vulnerabilities (Sect. 3). Then, we discuss the future directions of ontology-based services for vulnerability detection (Sect. 4). Finally, we draw a conclusion to this paper (Sect. 5).

2 Background

Depending on if execution of program is needed, there are two major categories of vulnerability detection techniques: static and dynamic analysis. Static analysis verifies the program in a formal and mathematical way, which does not execute program. Dynamic analysis instruments the program and monitors the execution.

2.1 Static analysis

Static analysis [7] does not require execution of program. In general, static analysis is a method of pattern matching. It looks for dangerous program constructs and invalid inputs. It requires experience and knowledge to set rules for vulnerabilities.

Transforming source code to an intermediate representation or model is a very common approach for static analysis. Some classic examples are abstract syntax tree (AST), control flow graph (CFG) and program dependency graph (PDG). These graph representations are very useful in flow-sensitive analysis. They can also be used to generate formal constraints for program behaviours; then, the constraints can be solved mathematically using a theorem solver. Finite state machine (FSM) is another popular abstract model for an event-driven system to reason state transfer and corresponding results.

The first symbolic execution [26] was proposed as a static method. It takes symbolic inputs which are not concrete (real) variables. Then for each possible execution path, it stores path conditions as first-order formulas. Finally, a satisfiability modulo solver is used to solve the formulas which gives input variables that satisfies the path condition.

Using static analysis is relatively easy to achieve high code coverage, because it does not depend on specific input to examine all possible paths. Depending on the abstraction level, with a cost of precision, static analysis can also be quite scalable. Static analysis is a cost-effective means to verify the program; however, there remain some challenges which are addressed below.

  • Precision: Not executing the program forces the static analysis process to make approximations. No approximation is perfect which leads to over or under approximation situation. Thus, static analysis would produce false-positive and false-negative results. False-positive results create instances when correct behaviours are being treated as bugs. A false negative is a bug that is not detected; it is also known as a miss.

  • Environment call: Programs are often not self-contained, which they call system or third party library functions, and the external code cannot be analysed locally. So to examine the interactions between them statically is not possible.

  • State explosion: There are potentially infinite number of inputs, paths and states of a program. Reason all possible states is not feasible within the computation power and time.

  • Difficulty of solving constraints: Solving constraints mathematically could use a lot of computation power, especially if there are nonlinear arithmetic formulas. The ability of solving constraints is a limitation for static analysis.

2.2 Dynamic analysis

Dynamic analysis monitors the execution of program. A concrete execution of a program is the actual run opposed to the abstraction used in static analysis. Dynamic analysis is precise, if an unexpected behaviour is observed, there must be a bug within this execution. Taint analysis, concolic execution and fuzzing are three most used dynamic analysistechniques.

Taint analysis [26] tracks the run-time information flow affected by a predefined taint source, such as input. Taint analysis achieves higher precision because it obtains run-time flow, which reduce the approximation problem, but at the cost of coverage.

Concolic execution [10] is the combination of concrete execution and symbolic execution. The goal is to deal with external calls that could not be examined symbolically. The word concolic mixes the first part of concrete “conc” and the last half of the word symbolic “olic.” In the process, symbolic expressions are generated along with concrete runs. It stores concrete information and symbolic constraints at the same time. It can be used to increase code coverage by reasoning unexplored path.

Fuzzing [11] is also an effective method. It tests a program by continuous generation of input suites. Efficiency of fuzzing system is the number of input samples used compare to vulnerabilities detected. This has always been a problem. Fuzzing is often combined with other techniques (static analysis, taint analysis, symbolic execution, etc.) to guide the generation of test inputs, thus achieving higher efficiency.

Dynamic analysis has a very low false-positive rate, and generating test inputs could be cheap in terms of computation power depending on the generation strategy. Below are the main challenges.

  • Efficiency and coverage: Dynamic analysis requires test inputs. Security testers probably have not been involved in the software development phase, so they may not have enough domain knowledge to generate effective inputs. Inputs generated cannot achieve high coverage and are not deep enough to reach hidden vulnerabilities.

  • Challenges inherited from other techniques: Whendynamic analysis is combined with other techniques, the challenges from other techniques remain. For example, path explosion of symbolic execution which makes it unscalable to complex systems.

3 Classification and implementation

In this section, we present the existing work on ontology-based services for vulnerability detection. Based on our research, we classified research into four categories.

  • Source code ontology: Like an intermediate language used in static analysis, some research proposed to transform source code to a knowledge-based representation. While some of them are not designed for vulnerability detection, we still list them as they give some insights on how to transform source codes.

  • Ontology-based static analysis: With program ontology and vulnerability pattern knowledge, static analysis is a feasible tool to use to detect vulnerabilities.

  • Ontology-based dynamic analysis: Program run-time information can be constructed as an ontology and used for dynamic analysis.

  • Vulnerability ontology: This kind of research focuses more on the knowledge of vulnerabilities. Ontologies can provide knowledge about vulnerability patterns and potential hazards.

The following section will describe the implementations for the four categories in detail.

3.1 Source code ontology

Understanding what the source code does is critical to software development, maintenance and bug isolation. Traditionally, there are graph representations, such as control flow graphs and program dependency graphs. These graphs are useful in flow analysis, but they lack real-world domain knowledge and functional information.

LaSSIE [12] was proposed by Devanbu et al in 1991. It was used to explore the relationships between different program components, thus to examine program complexity and structure. There were only four kinds of concepts the tool was concerned with, which were OBJECT, ACTION, DOER and STATE. These were general enough to describe a system, and classification algorithm was cheap and fast to develop. But LaSSIE was limited only to these terms, it was weak to describe software systems precisely. It could be used to address security concerns but could not detect vulnerability.

Yang et al. [29] extracted software knowledge based on four elements: classes, relations, functions and instances. It used RWSL [30] as ontology representation language. This tool was used to better understand legacy systems then make the re-engineering process easier.

Zhang et al. [33] introduced a Software Ontology for Understanding (SOUND) in 2006 which was a plug-in for Eclipse that could be used to recognise concepts and discover relationships in software and store them in description logics (DL). The ontology reasoner Racer [18] was used to reason security concerns such as object accessibility and exception handling. However, it could not discover vulnerabilities at the source code level.

Hong et al. [21] combined the domain ontology and a software ontology containing a class diagram to generate a new knowledge representation. It was capable of design flaw detection, consistency checking and domain understanding. Also, this ontology cannot detect source code vulnerability.

COMPonent REpresentation ontology (COMPRE) [1] was proposed by Alnusair and Zhao in 2010. The main objective was to identify reusable computer components. It combined a source code ontology, a component ontology and a domain ontology. The tool conceptualised reusable software libraries and defined common vocabularies to describe software components. Source code and corresponding ontologies were serialised into Resource Description Framework (RDF) triples. SPARQL queries were used to search required components. Only defined components were stored in the ontology; vulnerability detection was not feasible.

Table 1 Summery of source code ontologies

Ganapathy and Sagayaraj [16] also introduced an ontology to identify reusable codes in 2011. The framework extracted metadata of completed codes and components from application files and folders. Then stored them in Web Ontology Language (OWL). This was an application of integrating source code into the Semantic Web, which could also increase the consistence of code knowledge management. Codes extracted from software development and open source repositories could be reused and speed up development time. Like the above ontologies for code reuse, it could not detect vulnerabilities.

CodeOntology [3] was proposed by Atzeni and Atzori in 2017. It generated an ontology and RDF triples linking source code statements, abstract syntax tree, structural entities and documentation comments. Then, it was possible to run queries on the ontology for different purposes, such as static analysis and component reuse search. The exact static analysis method was not given in the paper, but the ontology contains enough source code detail. In a later paper, Atzeni et al. [4] further utilised CodeOntology to automatically generate code from natural language. An unsupervised machine learning algorithm was used to retrieve Java methods and code segments from CodeOntology given natural language description. Their experiment showed a highly accurate rate.

As described above, source code ontologies evolved from simple taxonomy and object extraction to detailed and comprehensive representation. The early prototypes might address security concerns but did not contain enough information to be used for vulnerability detection. A more recent approach, CodeOntology showed the potential to query vulnerabilities at source code level. Table 1 summarises the source code ontologies in terms of their target language, implementation ontology languages and primary functionalities.

Table 2 Summery of ontology-based static analysis tools

3.2 Ontology-based static analysis

Kirasic et al. [23] proposed a static analysis system for source code design pattern recognition in 2008. The system contained three subsystems: an AST parser transformed software AST to a XML form, OWL ontologies for program language and code pattern, an analyser took XML document as input and predefined ontologies to detect vulnerabilities. The system was designed for C#, and an example of identifying a singleton class was given in the paper. The code design patterns could be expanded, so this system had the potential to be used for known vulnerability detection.

Similar to Kirasic’s work, BugDetector [31] was proposed by Yu et al. in 2008 for Java programs. The tool consists of four parts: (1) bug pattern identification, (2) program specification ontology and bug pattern ontology, (3) AST generation and program ontology matching, (4) bug reasoning. Bug pattern identification was done manually by collecting more than 200 Java program bugs and translated them to Semantic Web Rule Language (SWRL). Then, a source code and bug relationships ontology was modelled accordingly. The ontology was used to reason bugs at program AST level. This tool was compared with a well-known static analysis tool FindBugs [5], and the results showed that BugDetector found more bugs given the same test software. This suggested that matching bug pattern ontology is more precise than the traditional techniques. However, the computation time for the BugDetector was much longer than FindBugs, and scalability remained a problem.

Yu et al. [32] later in 2011 introduced a static analysis framework specifically designed for security vulnerabilities based on the work they did on BugDetector [31]. The basic flow was similar to BugDetector, but this time, to reduce computational cost, the program was reduced by program slicing technique. Instead of reasoning a whole program, only sections of interested statements were examined. Also because only security bugs were considered, ontology reasoning complexity was further reduced. As a result, the execution time of this new tool was reduced significantly with nearly no increase in miss rate.

Paydar et al. [25] proposed a semantic web approach to detect Java source code design pattern. A RDFizer parser was used to transform source code to RDF semantic representation, and design patterns were modelled to SPARQL queries. Then SPARQL queries were executed on source code RDF triples. The experiment showed effective results for eight kinds of design patterns on three Java projects, including singleton, composite, decorator, etc. However, no security pattern was considered.

EkramiFard et al. [15] presented a source code security analysis model based on semantic web techniques. This model was a continuation of Paydar et al.’s [25] research. Java source code was transformed to RDF triples, and SPARQL queries were used for vulnerability matching. Their test case found two more kinds of bug patterns compared to FindBugs.

Table 2 highlights the target programming language of the tools, the languages the tools used to construct and query ontologies, the tools’ interested field. Ontologies can be used with static analysis methods to match vulnerability patterns and discover them. The traditional pattern matching static analysis produces a high false alarm rate if the pattern is too generic or it misses many bugs if the pattern is too specific. The ontologies of bugs describe the bug patterns in a more detailed and consistent way with domain knowledge, thus having potential to discover more bugs with lower false alarm rate. However, the computation power required is much higher.

3.3 Ontology-based dynamic analysis

Haider et al. [19] proposed a dynamic analysis technique using ontologies in 2010. Three ontologies were used together for the analysis: (1) a program ontology, (2) a run-time trace ontology, (3) an analysis requirement ontology. The program ontology represented the structure of the program in model and objects. The trace ontology gathered the run-time information. It contained entry and exit methods, initial inputs and method calls. The analysis requirement ontology was a description of requirements regarding the frequency of events.

Although this was called a dynamic analysis, the proposed method did not detect vulnerabilities. It analysed program components usage frequency, and reasoned about software development and code re-usability. This approach gave an insight of how a dynamic analysis could be integrated with ontology.

3.4 Vulnerability ontology

There is research that takes another angle which is to build ontologies of vulnerabilities instead of source code ontology to reason about common software weakness.

Table 3 Summery of vulnerability ontologies

Ontology for Vulnerability Management (OVM) [22] proposed by Wang et al. captured the relationships between software vendor, program, vulnerabilities, consequences, countermeasures and some other concepts. It was a high-level reasoning and decision-making tool which helped to describe the threats in a formal and precise way. It was easier to identify known vulnerabilities in the software than retrieving vulnerabilities directly from some common vulnerability databases such as CVE (Common Vulnerabilities and Exposures) and CWE (Common Weakness Enumeration).

Algahtani [2] introduced an ontology that linked software APIs and vulnerabilities. Modern software are developed based on libraries and APIs, which are hardly bug free. The goal was to track vulnerabilities and notify the developer if an API contains them. In this ontology model, multi-layer hierarchy was used, including a vulnerability ontology, a software build ontology, system concepts and some general domain knowledge.

Du et al. [13] proposed a knowledge graph to trace links between vulnerability and software components. They proposed ontologies for GitHub projects, Maven repositories and CVE database, respectively. Then, they refined links between GitHub projects with Maven repositories and CVE information with Maven repositories by an ontology matching approach. The matching accuracy between CVE and Maven repositories was as high as 99.88%.

Han et al.  [20] proposed DeepWeak in 2018, a knowledge graph for CWE vulnerabilities. Then, they presented a knowledge graph embedding method, combining textual and structural information of vulnerabilities to embed vulnerabilities and their relations in a low-dimensional vector. Furthermore, the embedding was a machine learning technique which could learn relationships and consequences that were difficult for human to discover. By verifying the different time stamps of their vulnerability ontology, they showed the ability of predicting potential bugs in their ontology.

Vulnerability ontologies as proposed above consider vulnerabilities and consequences. They are helpful and accurate in discovering known vulnerabilities, and the DeepWeak tool showed the potential to predict unknown vulnerabilities as well. Table 3 summarises the relations the vulnerability ontologies examined, and the techniques used for implementation.

4 Future directions

Knowledge-based information management has always been an interesting topic in artificial intelligence research. There are many advanced techniques that can be applied to vulnerability detection to improve their performance.

4.1 Standard ontologies for vulnerability detection

Unlike traditional intermediate representations of source code (e.g. AST, CFG, FSM), which have standard forms and implementation algorithms, an ontology for source code has not been discussed and agreed on in the software community.

Source code ontologies have shown to be useful in terms of software comprehension as discussed in Sect. 3. If we take the granularity to a method and statement component level, the ontologies can give clear attributes of variable range, method inheritance, condition constraints, etc. This information is very useful in static analysis. Unlike traditional analysis, control flow, data flow, method calls are analysed separately on different graphs; an ontology has the advantage of integrating all information together thus improving precision. So there is need for a well defined source code ontology designed specifically for vulnerability detection to work with existing analysis techniques.

Vulnerability ontology can also be quite useful to describe vulnerability patterns and properties. However, different researchers gather different vulnerability concepts based on their research interests. The traditional vulnerability databases (e.g. CVE, CWE) describe vulnerabilities in a casual human language manner, which is not ideal for knowledge sharing. To focus on vulnerability detection, there needs to be a common, unambiguous, formal representation of vulnerabilities.

4.2 Ontology embedding

As shown in BugDetector [31], the computation power required to match source code AST with a program bug pattern ontology was a problem. In the continuation of BugDetector, they used program slicing to reduce the size of the subject under test. It was effective, but deletion of source code led to a higher miss rate. In the experimental set-up, the miss rate was very low, but may cause problems for complex systems.

Ontology embedding is a way to represent graph base ontology to a low-dimensional vector space. In the vector space, traditional database search and graph traversal is not required and vector matching is much faster.

DeepWeak [20] was the only research applied ontology embedding. A machine learning model was used to understand the structure and description of vulnerability then transform the ontology to a lower-dimensional vector. The vector might contain missing relations and consequences which were not previous identified by humans. Thus, it could be used to predict vulnerability links and unknown consequences. DeepWeak used a relative simple model for ontology embedding and achieved good performance. There are some other embedding approaches such as TransE [9], TransR [24] and description-embodied knowledge representation learning (DKRL) [28]. They are more complicated and may achieve better performance.

5 Conclusion

This paper demonstrated the existing work around ontology-based services for source code vulnerability detection. We discussed some key aspects and challenges of current vulnerability detection techniques. Then, we presented the basic design principles of existing ontology services. We found that source code ontologies vary significantly in the way they extract information, some could be statically analysed but no specific method was given, and most of them did not consider bug detection. Some ontology-based static analyses were able to discover more bugs but with the cost of computation power and scalability. An ontology could also be used dynamically by monitoring and obtaining a software run-time information. Vulnerability ontologies were accurate in linking known bugs and software components. Based on the structure and description, they could also apply ontology embedding to predict hidden vulnerabilities. There are many advantages of ontology which previous research did not consider, such as machine learning and embedding. Thus, many improvements can be made to existing ontology-based vulnerability detection methods. We hope to provide deeper understanding and insights to this field.