Keywords

1 Introduction

With the advent of information society and the popularization of software applications, more security problems of computer are arising from software vulnerabilities. Software vulnerability is weakness in software systems that may cause the application crash or be exploited by a threat to gain unauthorized access to information [1, 2]. So software vulnerability detection has been a research focus of information security in recent years [3]. And various detection approaches are put forward [4, 5].

1.1 Motivation and Contributions

All these existing approaches are falling into three main categories: static, dynamic, and integrated analysis systems. But due to lacking proper theory to accurately describe characteristics of vulnerability, they are imprecise, resulting in a large amount of false positives. Dynamic analysis systems, such as “fuzzers”, can provide conditional inputs. However, they suffer from exhaustive test cases.

Current researches agree that every software vulnerability is caused by some flaws or defects of the software [3]. Most software defects and flaws are parts of software inherent attributes and they always occur regularly [6]. So we believe that software vulnerabilities follow certain patterns and can be identified by them if the patterns are accurately be described [7].

Many years of the research on the vulnerabilities detection make us believe false positives are caused by suspicion or misunderstanding, the both are due to the lack of an accurate formal description of the vulnerability especially for some high speed detecting tools.

In this paper, we propose a vulnerability static detection model by abstracting the characteristics of a variety of vulnerabilities in form of propositional function. We focus on software source code detection and try to formalize patterns of vulnerability. If there is a violation of patterns in a program, there will be software vulnerability. We discriminate and describe a variety of software vulnerabilities formally by this model.

1.2 Related Work

In the following, we briefly review the prior work most closely related to our model in two groups: theoretical approaches and mature tools.

Clarke proposed a formal software vulnerability testing technology which can judge whether a given program meets the pre-defined characteristics or not by traversing the state space [8]. Obviously, there will be a state space explosion when it is used to detect large-scale programs.

Describing vulnerability characters appropriately is a critical step for its detection. Wilander proposed a generic way to model the security characteristics of codes by vulnerability dependency graph [9].

B. Liang and K.K. Hou proposed an expanded finite-state machine model which can traverse the possible executable paths in a program statically and identify the current operation. This model reduces the false positives to some extent, but all possible executable paths in the program need to be traversed, so the detect efficiency still needs to be improved [10].

Compass is a static analysis tool for checking source code designed by ROSE Team [11]. It does not describe the characteristics of vulnerabilities in depth.

Some lightweight approaches include Rats [12], Prefast [13] as well as Splint [14], they can not find deep layer vulnerabilities and also require manual annotations. FindBugs is a static analysis tool to find defects in Java code but not a style checker.

There are some other tools, like Coverity, Fortify, CodeSonar, and IBM Security AppScan Source (formerly Rational). Due to the auto scanning, those tools can make thorough analysis with configurable rule sets. Lack of formal description of vulnerability, thorough scanning need long run time and the false positive rate is still high [15].

2 Static Detection Model Based on Propositional Function

CFG (Control Flow Graph) and PDG (program dependency graph) are two important useful data structures for program static analysis [16, 17]. A CFG is a directed graph that shows all paths might be traversed through a program during its execution, whose edges represent possible flow of control between statements [18]. Amed at describing vulnerability conveniently by the propositional function, we define the related concepts in CFG and PDG at first.

2.1 Related Definitions in CFG

Let n\(_i\) and n\(_j\) be two nodes on CFG:

Definition 1

In a CFG, if there is sequence \(p=\,{<}n_0,...,n_m{>}\) which meets (\(n_{i-1},n_i\)) \(\in \) E, where \(i = 1,2...m.\) Then there is an executable path between \(n_0\) and \(n_m\), denoted by \(EP(n_0,n_m)\). The set of all the executable paths in program denoted by EP.

Definition 2

If there is an executable path \(EP(n_i,n_j)\) between \(n_i\) and \(n_j\), then \(n_i\) is the predecessor node of \(n_j\), denoted by \(Pred(n_i,n_j)\); \(n_j\) is the successor node of \(n_i\), denoted by \(Succ(n_i,n_j)\). Let n be a sentence, and the set of all its predecessor nodes called the precursor node set of n, denoted by Pred(n). The set of all its successor nodes is called the successor node set of n, denoted by Succ(n).

Definition 3

In a CFG, \(n_i\) is post-dominated by \(n_j\) if every directed path from \(n_i\) to Exit(not including \(n_i\)) contains \(n_j\), denoted by \(PD(n_j,n_i)\). It should be noted that \(n_i\) is not the post-dominator of itself. Let n be a sentence, and the set of all its post-dominator is called the post-dominator set of n, denoted by PD(n).

Definition 4

There is an executable path \(EP(n_i,n_j)\). \(n_j\) is data dependent on \(n_i\), denoted by \(DD(n_j,n_i,v)\) if

  1. (1)

    there is a variable v, the value of v at \(n_i\) has been used during execution of \(n_j\).

  2. (2)

    v is not redefined on \(EP(n_i,n_j)\).

Definition 5

There is an executable path between \(n_i\) and \(n_j\). \(n_j\) is control dependent on \(n_i\), denoted by \(CD(n_i,n_j)\) if

  1. (1)

    each node on \(EP(n_i,n_j)\) from \(n_i\) to \(n_j\) (except \(n_i\) and \(n_j\)) is post-dominated by \(n_j\).

  2. (2)

    \(n_i\) is not post-dominated by \(n_j\).

2.2 Vulnerability Static Detection Model Based on Propositional Function

Following the above definitions, we can construct our detection model based on propositional function.

Definition 6

The detection model is defined as a five-tuple denoted as \(Vulnerability=\{ n_0, F, S, P, Q\}\). It includes the vulnerability initial nodes set, program state space, Vulnerability Syntax Rules, preconditions of vulnerability, and post-conditions of vulnerability. The followings are the detailed description of the five-tuple.

Vulnerability initial nodes set \(n_0\). \(n_0\) is the initial characteristic nodes of vulnerability which is the entrance node of vulnerability detection. For a program M, its sentence is finite. So, the vulnerability initial nodes set \(n_0\) is finite and certain.

Program state space F. F is the program state space extracted from source code, CFG and PDG. It contains the EP in program, control dependency and data dependency among nodes. F is an Intermediate Representation which contains all necessary information for vulnerability detection, and it can not be empty.

Vulnerability Syntax Rules set S. S is a set of vulnerability syntax rules which are state transition rules between vulnerability initial nodes set \(n_{0}\) and vulnerable nodes set N on EP.

Precondition P. P is Precondition which means that any node n(\(n\in N\)) must meet these state conditions before executing, where N is the set of nodes related to vulnerabilities. Otherwise, there will be a vulnerability.

Post-conditions Q. Q is Post-conditions which means that any node n in N must meet these rules after executing. Otherwise, there will be a vulnerability.

With the model above, the complete process of vulnerability detection can be described as . This process has two steps:

Step 1. Locate vulnerabilities roughly. means that we find the vulnerable node which conforms to the Vulnerability Syntax Rules set S from \(n_0\) on EP. For any EP\(_i\)(EP\(_i\) \(\in \) F), if there is a node \(n_1\) conforms to , \(n_1\) is a vulnerability related node, \(n_1\)\(\in \) N.

Step 2. Locate vulnerabilities precisely. \(\{\)P\(\}\)N\(\{\)Q\(\}\) means that we detect the vulnerable nodes set N by Precondition P and Post-conditions Q. For any \(EP_i\)(EP\(_i\)\(\in \) F), if there is node \(n_2\) conforms to before \(n_1\) executing, and there is node \(n_3\) conforms to after \(n_1\) executing, the detection result is TRUE, and \(n_1\) does not have a vulnerability. Before \(n_1\) executing, if \(\lnot \forall n_{2}\) conforms to , or after \(n_1\) executing, \(\lnot \forall n_{3}\) conforms to , detection result is FALSE and \(n_1\) has a vulnerability.

Next we will use propositional function to describe some types of software vulnerabilities.

2.3 Formal Description of Software Vulnerability Based on Propositional Function

In this paper, we focus on describing and detecting nine software vulnerabilities in four types with CWE number which are the most prone to general programs, as shown in Table 1. Before formulating these software vulnerabilities, we also need some definitions in the form of propositional function.

Table 1. Software vulnerabilities

Definition 7

The way to use variable v can be described as definition-use-check relationships. DEF(vn) means the statement, definition or assignment of v at sentence n; USE(vn) means v is used or cited on node n; CHECK(vnStatement) means detect the statement of v on node n. For example, CHECK(vnNull) means detecting whether the statement of v on n is Null or not, and the check result will be True or False.

Definition 8

The type of a parameter in program M can be described by corresponding propositional functions. For example: pointer variable \(v=\{v|\exists v\in M,\) type of v is Pointer\( \}\) is denoted by Pointer(v); function f is denoted by Function(f), etc.

Definition 9

Use the ResourceAllocateFunctionList to denote the function set related to resource allocation. In C Programming Language the common resource allocation functions are malloc(), fopen(), calloc(), new(), etc. The ResourceAllocateFunctionList(n) denotes resource allocation functions on node n, abbreviated as RAF(n). ResourceRelease(n) means to release resources related to resource allocation functions RAF(n) on node n, abbreviated as RR(n).

Definition 10

The format functions are denoted by FormatFunctionList. In C Programming Language common format functions include printf(), strncpy(), fwprintf(), snwscanf(), fprint(), printf(), etc. FormatFunction(n) means format function which is called on node n, abbreviated as FF(n).

Definition 11

The buffer related functions are denoted by BufferFunction(n). The common buffer related API functions include memcpy(), strcpy(), sprint(), vsprintf(), gets(), scanf(), strcat(), etc.

Definition 12

Propositional function CallFunction(n) means the information of functions called on node n.

Definition 13

Propositional function SharedResource(vn) means shared resources on node n in program v. Propositional function SharedResource(v) means the set of all shared resources in program v.

Definition 14

Propositional function \(IsIn(n_{1}, n_{2})\) means \(n_{1}\subseteq n_{2}\) and propositional function \(\lnot IsIn(n_{1}, n_{2})\) means \(n_{1}\varsubsetneq n_{2}\).

With these formal definition we can present formal propositional function for software vulnerability. We summarize the characteristics of these vulnerabilities and achieve its propositional function.

(1) Null Pointer Dereference. For a target program M, the set of sentence n which defines or declares pointers in M is vulnerability initial nodes set denoted by n\(_0\). Its propositional function is:

$$\begin{aligned} n_{0}=\{n| \exists n \in M \wedge DEF(Pointer(v),n)\}. \end{aligned}$$
(1)

On any executable path \(EP_{i}(EP_{i}\in EP)\), if there is a successor node \(n_1\) of \(n_0\) which calls Pointer(v) and is data dependent on \(n_0\) with Pointer(v), \(n_1\) is a vulnerable node s. Its propositional function is:

$$\begin{aligned} S=Succ(n_{1}, n_{0}) \wedge DD(n1,n0,Pointer(v)) \wedge USE(Pointer(v),n_{1}). \end{aligned}$$
(2)

On this executable path \(EP_{i}(EP_{i}\in EP)\), if there is a node \(n_2\) which is data dependent on \(n_0\) with Pointer(v) and vulnerable node \(n_1\) is control dependent on \(n_2\), and Pointer(v) is Null on \(n_2\), \(n_1\) does not have any vulnerability. Otherwise, \(n_1\) has vulnerabilities. Its propositional function is:

$$\begin{aligned} P=DD(n_{2},n_{0},Pointer(v)) \wedge CD(n_{2},n_{1}) \wedge CHECK(Pointer(v),n_{2},NotNull), \end{aligned}$$
(3)

Q on NPD is Null.

(2) Buffer overflow. For a target program M, the set of sentence n which calls the buffer related functions is vulnerability initial nodes set denoted by \(n_0\). Its propositional function is:

$$\begin{aligned} n_{0}=\{ n| \exists n \in M~ wedge CallFunction(n) \subseteq BufferFunctionList\}. \end{aligned}$$
(4)

On any executable path \(EP_{i}(EP_{i}\in EP)\), if \(n_0\) is data dependent on \(Buffer(v_{1})\) which is defined on the predecessor node \(n_1\) of \(n_0\), \(n_0\) is a vulnerable node. Its propositional function is:

$$\begin{aligned} S=DD(n_{0},n_{1},Buffer(v_{1},n_{0}))\wedge DEF(Buffer(v_{1}),n_{1}) \wedge Pred(n_{1},n_{0}). \end{aligned}$$
(5)

On this executable path \(EP_{i}(EP_{i}\in EP)\), if there is a node \(n_2\) that \(n_0\) is control dependent on, and \(n_1\) is the postdominator of \(n_2\), and both buffer size and input data length are matching, \(n_2\) does not have vulnerability. Otherwise, \(n_2\) has vulnerabilities. Its propositional function is:

$$\begin{aligned} \begin{aligned} P=&CD(n_{2},n_{0}) \wedge PD(n_{2},n_{1}) \wedge CHECK(buffer(v_{1}),input(v_{2}),n_{1},Size) \cup \\&CHECK(buffer(v_{1}),input(v_{2}),n_{0},Size), \end{aligned} \end{aligned}$$
(6)

Q on Buffer Overflow is null.

(3) Uncontrolled Format String. For a target program M, the set of sentence n which calls FormatFunctionList is vulnerability initial nodes set. Its propositional function is:

$$\begin{aligned} n_{0}=\{n|\exists n \in M \wedge CallFunction(n)\subseteq FormatFunctionList\}. \end{aligned}$$
(7)

On any executable path \(EP_{i}(EP_{i}\in EP)\), if \(n_{0}\) is data dependent on variable v defined on its predecessor node \(n_{1}\), \(n_{0}\) is a vulnerable node. Its propositional function is:

$$\begin{aligned} S=Succ(n_{1},n_{0}) \wedge DEF(v,n_{1})\wedge DD(n_{0}, n_{1}, v). \end{aligned}$$
(8)

On this \(EP_{i}\), if both the type and the number of parameters in a format string function are matching on node \(n_{0}\), the result is TRUE, which means \(n_{0}\) does not have vulnerability. Otherwise, \(n_{0}\) has vulnerabilities. Its propositional function is:

$$\begin{aligned} P=CHECK(FF(n_{0}), n_{0}, Parameter), \end{aligned}$$
(9)

and Q is null.

(4) Resource Related Flaws. In target program M, the set of sentence n which defines or declares a variable v belonging to resource allocation functionlist is a vulnerability initial nodes set \(n_{0}\). Its propositional function is:

$$\begin{aligned} n_{0}=\{n|\exists n\in M \wedge DEF(v,n)\subseteq ResourceAllocateFunctionList\}. \end{aligned}$$
(10)

On any executable path \(EP_{i}\), if node \(n_{1}\) which is the successor node of \(n_{0}\) calls resource allocation functions on \(n_{0}\) denoted by \(RAF(n_{0})\) and is data dependent on \(n_{0}\) which is the precursor node of \(n_{1}\) with \(RAF(n_{0})\), \(n_{1}\) is a vulnerable node. Its propositional function is:

$$\begin{aligned} S=USE(RAF(n_{0}),n_{1}) \cup Succ(n_{1},n_{0}) \wedge DD(n_{1},n_{0},RAF(n_{0})). \end{aligned}$$
(11)

On this \(EP_{i}\), if there is not resource release operation \(RR(n_{2})\) corresponding resource allocation functions \(RAF(n_{0})\) on node \(n_{2}\) which is the precursor node of \(n_{1}\),the precondition P is True. Its propositional function is:

$$\begin{aligned} P=Pred(n_{2},n_{1}) \wedge IsIn(RR(n_{2}),RAF(n_{0})). \end{aligned}$$
(12)

On this \(EP_{i}\), if there is node \(n_{3}\) which is the post-dominator of \(n_{1}\) and is data dependent on its predecessor node \(n_{0}\) with \(RAF(n_{0})\), and there is resource release operation \(RR(n_{1})\) or \(RR(n_{3})\) corresponding \(RAF(n_{0})\), the post-conditions Q is True. Its propositional function is:

$$\begin{aligned} \begin{aligned} Q&=PD(n_{3},n_{1}) \wedge DD(n_{3},n_{1},RAF(n_{0})) \wedge (IsIn(RAF(n_{0}),RR(n_{3})) \\&\cup \, IsIn(RAF(n_{0}),RR(n_{1}))). \end{aligned} \end{aligned}$$
(13)

We must consider the case that if \(n_{0}\) can not deduce \(n_{1}\) by the vulnerability syntax rules set S, the variables or functions defined on \(n_{0}\) belong to redundant code.

3 Detection Algorithm

Based on this model, we design a static detection process for software vulnerability analysis, as shown in Fig. 1. It includes: basic information analysis and rules.

3.1 Basic Information Analysis

The basic information analysis module is used to generate and extract some basic static information from target program, as shown in Fig. 1.

Fig. 1.
figure 1

Basic processing module flow chart

Firstly, use lexical and syntax analysis to extract vulnerability initial nodes set \(n_0\) from target program source code. Secondly, use compiler front-end (such as GCC, java compiler) to generate abstract syntax tree AST, and construct CFG and CG (call graph).

3.2 Solution of Vulnerable Nodes Set N

Vulnerable nodes set N is a set of nodes which may contain vulnerabilities. Solving N is the coarse locating process of vulnerability analysis which we described in Sect. 3.2. Steps of Solving N are as follows:

Step 1. Search the program state space F starting from vulnerability initial nodes set \(n_0\).

Step 2. Insert the nodes which conform to vulnerability syntax rules S into vulnerable nodes set N.

Algorithm 1 is shown as follow:

figure b
figure c

3.3 Vulnerability Syntax Rules Database

Vulnerability rules database is a database which includes vulnerability syntax rules set S, vulnerability preconditions P and vulnerability post-conditions Q. It also contains some API functions, such as ResourceAllocation FunctionListFormatFunctionListBufferFunctionList, and so on.

According to Definitions 25 in Sect. 2.1, Definition 6 in Sect. 2.2 and Definitions 7, 12, 14 in Sect. 2.3, we summarize the vulnerability syntax rules for SPQ in Table 2.

Table 2. Vulnerability syntax rules SPQ of the four types of vulnerability
Table 3. Verify and confirm the vulnerabilities disclosed by NIST

4 Experiments and Evaluation

To evaluate the effectiveness of our approach, we proceed to evaluate our method by carrying on experiments from Sep., 2013 to May, 2016 on 4 open resource software and contrast of detection on CWE-476(Tomcat4.0) of our method with FindBug 3.0.1.

We have verified and confirmed the vulnerabilities of 4 open source projects which was disclosed by NIST, shown as Table 3.

As shown in Table 3, the open source projects that we test covering all types of vulnerabilities we describe in this paper. Results show that resource operations flaws, null pointer dereference and format string have high detection accuracy. And buffer overflow has low false positives rate and false negatives rate. Therefore, the method that we propose can be applied to detect most of vulnerabilities.

A Java project tomcat 4.0 is a real software system widely used as serverlet container, whose vulnerabilities are also disclosed by NIST. Here we use our method to detect \(Null\ Pointer\ Dereference(CWE\)-476) and compare the result with that of FindBugs 3.0.1. Although FindBugs has been around for a long time, due to its universality and openness, we select the newer FindBugs 3.0.1 as the experiment tool. The result is shown in Table 4.

Table 4. Null Pointer Dereference (CWE-476) detection results contrast on Tomcat 4.0 between FindBugs 3.0.1 and FOLB \(\hat{}\) EPS

The result shows that the detection rate of our method is higher than that of FindBugs and the False positives rate is also lower than that of FindBugs.

5 Conclusions and Future Work

In this paper, we proposed a static vulnerability detection model based on propositional function. Firstly, we defined and described the existing preconditions, characteristics and properties of vulnerabilities, and gave corresponding discriminant formula in terms of propositional function. Then constructed our detection model with a five-tuple. Then we designed a static detection process according to the new model, and used propositional function to described four types of disclosed software vulnerabilities in CWE. Finally, we carried out experiments, to verify that our model based on the propositional function realized more accurate description of vulnerability, improved detection rate and reduced the false alarm rate.