Keywords

1 Introduction

Modern software typically must be able to interact with the whole world. While until recently such worldwide interaction was quite rare now almost any business software has a web interface allowing anyone to interact with the software be it only by entering something into the username/password fields that are openly accessible to anyone.

Since almost anyone writes software exposed to the resulting security threats, one can no longer rely on high skill and experience of specialists who were formerly only having to deal with such risks. To address this issue programming guidelines and best practices have been developed, see e.g. www.owasp.org, that summarise and condense the expert knowledge and make it available to a larger community (secure coding) [30]. Whether or not such programming guidelines are applied and whether they have been correctly applied, is however left to the good will of the programmers.

For this reason, we have proposed the project GuideForce (funded by DFG since 11/2014) in which we want to develop automatic methods based on type systems that are capable of checking that programming guidelines have been correctly and reasonable applied without compromising the flexibility of writing code. Besides further developing type system methodology this also requires us to devise a formalism to rigorously define such policies which typically are given in plain English and by examples. In order that users will actually trust the system and perceive it as a useful tool it will be necessary to achieve a rather high degree of accuracy. For example, if an already sanitized user input is stored in a string buffer and later on read out it is not necessary to re-sanitize it. If the system does not recognize such a situation users will neglect its warnings in the future. Similarly, to ensure appropriate authorization prior to accessing sensitive data then, if such access happens within a method of a class admitting the invariant that authorization has taken place prior to creation of any of its objects then the system must be able to discover this. All this means that cutting edge techniques such as careful analysis of strings, objects, and control flow, must be harnessed and further developed in this project. In order to guarantee appropriate feedback to the user and to achieve seamless integration we will use type-theoretic formulations of these methods resulting then in a single customizable type system capable of enforcing a large span of guidelines and best practices for secure web programming.

In [15], we developed a type-based string analysis for the enforcement of guidelines to detect code injection attacks, and a tool has been implemented using OCaml [7].

For GuideForce, however, we plan not to rely on a standalone implementation, but rather to build our tools on top of the Soot framework or one of its competitors. Soot is a Java-based program analysis framework developed by the Sable research group in McGill University [6, 23, 39]. To gain some experience we developed a small pilot study in Soot that we describe in Sect. 4 below. This will allow us to take profit of Soot’s front end and to base our analyses on intermediate code. We also hope that the use of a well-established framework will ease maintenance and changes in the development team. Furthermore, we will be able to take profit of Soot’s built-in fixpoint engines for dataflow analysis. While our analyses will mostly be type-based it is well-known that type inference can be understood as a dataflow problem, see e.g. [29].

The key scientific innovations of the project are the focus on guidelines rather than risks and the development of a configurable type system. In Sect. 3 we describe the GuideForce project in more detail and provide examples of guidelines to illustrate our approach. In Sect. 4 we describe our pilot study which is a taintedness analysis for the prevention of Xpath injection and developed on top of Soot.

We stress that the purpose of this project-start paper is not so much to describe finished results, but to motivate and advertise the goals of our recently begun project so as to create opportunities for interaction.

Human orientation. The focus on guidelines and the use of type systems are thus what distinguishes our approach from most existing ones. While still maintaining high accuracy and efficiency which are the traditional objectives in static analysis this adds a new human-oriented component to the field. Types present even the intermediate results of a static analysis in a concise yet human-readable manner and also allow, in the form of manually provided type annotation for interaction with the analysis process. Guidelines, on the other hand, have been designed by human experts with human users in mind. Enforcing guidelines not only helps to prevent vulnerabilities and attacks, but also has a forensic and thus human-oriented component: if a programmer or subcontractor can show that all guidelines have been followed they might not be responsible for a possible unforeseen attack.

2 Related Work

The use of formal approaches to guarantee adherence to a secure coding guideline rather than directly preventing the absence of vulnerabilities is a recent idea; the only directly related work we are aware of is [8] where guidelines are formalised as temporal logic formulas. The novel aspects of our approach in comparison with [8] are the precise tracking of control flow, even interprocedural, the analysis of string values as is required to model guidelines for preventing code injection attacks [9, 15], the tracking of class invariants, and the presentation as a type system akin to the Java type system which will allow for meaningful feedback to the user.

There is a huge body of literature on the use of formal methods, in particular type systems for directly preventing attacks rather than enforcing a guideline and the methods used there are also relevant for us and, indeed, without the progress in those areas in the last 10 years a project like ours could not be carried out. In the following summary we focus on our running example, namely prevention of code injection attacks.

Test-based techniques [20] identify points in an application that are subject to code injection attacks, and build attacks that target such points using knowledge about typical attack patterns. Black-box testing in particular does not make any assumption about the implementation of an application. By design, this method in general cannot guarantee the absence of attack possibilities; they rather aim to decrease their occurrences by applying reasonable testing heuristics.

Another approach to prevent code injection is to monitor or instrument a program. The goal is to raise an exception if an action is detected that does not conform to a certain policy. Since it is in general hard to decide which parts of a generated code are intended by the program and which are injected by the user, the program is either instrumented to mark user-dependent parts [37] or combined with a static analysis [16].

The most interesting approach for us is to statically analyse a program in order to verify that code injection attacks cannot occur during the execution of a program [9, 21, 26, 41, 42]. These approaches usually rely on string analysis to compute the possible values that a string variable can take during runtime, or on taint analysis to determine the origin of a string object, or most commonly on a combination of both.

The work we referred so far comes mainly from academia. On the industry side, many static analysis tools that help with secure web programming have been developed such as CheckMarx [1], AppScan [4], Coverity [2] and Fortify [3]. These tools protect against vulnerabilities for various languages (e.g. C, Java) and also claim compliance with guidelines offered by institutions such as OWASP, MISRA, SANS, and Mitre CWE up to different levels. Despite this fact expressed in their data sheets, the question of how the commercial tools formalize the guidelines differs from one tool to another and the common practice is to hardwire a given guideline into the tool. One of our goals is to develop a configuration mechanism to formalize the guidelines so that they will be written separately from GuideForce source code and are open to independent review by experts.

Furthermore such a configuration mechanism will make design parametric with respect to guidelines. A developer will only need to write down the configuration file when analyzing their code and be able to check the code against as many guidelines as they want at any point of software development cycle. We also aim at bringing the guidelines rather than vulnerabilities into the focus of academic study.

Type systems. One may note that none of these methods provides a formal guarantee. While guaranteeing the complete absence of vulnerabilities is not feasible one can very well guarantee that a particular guideline has been enforced. Type systems [32] are particularly apt for establishing rigorous guarantees for they draw a clear distinction between the declarative statement of a program property by means of a typability relation, and the automatic verification of the property using a type inference algorithm. Type systems have been successfully used not only to prove data type safety, but also to enforce security requirements such as non-interference properties used in information flow security [24, 25, 27, 36]. Of special interest are also the recent works on providing strong typing guarantees for scripting languages such as [17, 18]. In another recent work [19], the authors define a type system along with an inference system and implement a type-based taint analysis on top of Checker Framework to analyze Java web applications.

One may note that strong type systems are used in mainstream languages such as Java, C# and play a central role in new programming languages such as Scala and F# which come originally from research and are now beginning to be used in the productive sector. One may thus argue that programmers are familiar with the form of feedback offered by types.

An early work proposing the use of type systems in the context of secure programming is [40] where a type system is used to check runs of functional programs against a security automaton [34].

Human-oriented approaches. Several researchers have already pointed out that the benefits of static analysis go beyond finding bugs in source code. In [28], various applications of static analysis to other than finding bugs such as code-review and computing human-readable properties of the software were outlined. The Agile Manifesto [5] emphasized that the focus should be human and interactions among others. In [14] authors give a detailed explanation of static analysis as part of code review, explain secure programming guidelines(for web applications etc.) and provide exercises with Fortify code analyzer [3]. The industrial tools including Fortify, mentioned earlier in this section, allow user interaction; developer receives highlighted source code in a human-readable format.

3 Our Method

In this section we sketch how the final outcome of this project might look like. The workflow of our system is depicted in Fig. 1.

Fig. 1.
figure 1

GuideForce (Type-Based Guideline Enforcement)

The input to the GuideForce system (Type-Based Guideline Enforcement) consists of two parts: an annotation file formalising a guideline and containing instrumentation information and an automaton (“guideline automaton”) expressing the required behaviours and a source program in Java (or an appropriate subset thereof). The program is then automatically annotated according to the annotation instructions in the annotation file and subjected to inference of effect types, the effects being derived from the guideline automaton. We note that automata have been used in the past to describe program properties to prevent bugs, i.e., in SLAM project from Microsoft [11] and in Metal project from Stanford [13] both of which are used to analyze C code. However, the difference of our approach is that we later construct type and effect systems from automata in the next phases. This process results in a list of constraints typically over finite sets whose solutions correspond to valid typing derivations. If a solution can be found then this implies that the source program definitely abides by the formalised guideline. If no solution can be found then the program might violate the guideline and the reason for this belief can be clearly pinpointed in the form of a typing error that will be reported to the user.

To illustrate the format and the workflow of our approach we now describe two simple concrete instances. The first one about sanitization of user input is modelled after [15] and related to prevention of SQL injection attack.

Example 1. Sanitization In order to avoid code injection, programmers are required to sanitize strings to be output through a browser window by calling appropriate framework methods that remove potentially dangerous components such as JavaScript code or dangerous URLs. Strings that stem from the user are regarded as potentially dangerous and in need of sanitization whereas string literals contained in the program are regarded as safe. The guideline in plain English is as follows:

String output must be sanitized.

The following code gives an example which follows this guideline.

figure a

Output strings within tags <script> and </script> must be sanitized by calling the framework method escapeToJs and those within tags <html> and </html> must be sanitized by calling the framework method escapeToHtml.

We decorate strings with labels describing their origin, whether they contain tags like <script>, </html>, etc., and if any sanitization they have undergone.

Given this (imaginary) instrumentation we can then specify the guideline by the following automaton:

Here e2JS tags the output of escapeToJs and e2Html tags the output of escapeToHtml. String literals \(\langle script\rangle \) and \(\langle /script\rangle \) are mapped to Script and /Script respectively. All other strings are tagged as Lit. It is not hard to see that in the above code snippet, the tags of strings to be output will be \(\mathtt {Script \cdot e2JS \cdot /Script}\) and \(\mathtt {Lit\cdot e2Html \cdot Lit}\) respectively. They are accepted by the above automaton. This information is supplied by users and stored in the annotation file whose details we omit here for lack of space.

Example 2. Authorization Let us consider the following guideline:

Any access to sensitive data must be done after authorization.

This guideline requires the programmer to call function auth before access. The following code fragment follows the guideline:

figure b

Assuming that any successful call to the framework method auth issues an Auth event and that every call of framework method access issues an Access event we can specify the policy with the following finite state machine.

The guidelines that we formalised here are only two examples and by no means restricted to sanitization and authorization. Other examples include appropriate initialisation and finalisation of data structures and resources, appropriate bounds checking, integrity checking of downloaded code, execution with appropriate privileges, etc.

The Soot-based implementation lets us design a human-readable output file from a given source file. Using built-in visualization tools in Soot, we will be able to reproduce the source code with highlighted lines where the (formalized) guidelines might be violated. Furthermore, we plan to provide the user a detailed explanation of the guideline and also how the highlighted lines violate it. Thus, this mechanism increases the readability and understandability of output of the analysis by human developers.

4 Taintedness Analysis for XPath Injection

To see whether Soot fits our purpose, we decided to develop an analysis with it for a very simple but still meaningful guideline. For this purpose, we selected a guideline from OWASP [31] to prevent XPath injection. XPath injection is a type of code injection that allows structured access to a node in a XML document tree. In this simple example the use of types was not necessary; instead we employed a rather straightforward taintedness analysis.

In the rest of this section, we first explain a sample XPath injection attack, and then explain our implementation according to flow analysis framework provided by Soot. Our analysis is twofold. First, we develop an intraprocedural analysis to analyze method bodies based on forward flow analysis provided by Soot. Second we develop extend the intraprocedural analysis to an interprocedural analysis to analyze Java classes and methods of an application that uses XML documents to store sensitive data.

In fact our interprocedural analysis can be seen as an instance of the summary-based (also called functional) approach introduced by Pnueli and Sharir [35]. An efficient framework that is based on the work in [35] is IFDS given by Reps, Horwitz and Sagiv [33] which is based on graph reachability. Soot has an extension, called HerosFootnote 1, that implements the IFDS framework; however, we preferred to implement our own interprocedural analysis directly within Soot since this is considerably simpler and also, as we believe, will facilitate the extension to later more complex type-based disciplines.

We also remark that recently several tools for taint analysis have appeared, notably Andromeda [38] for Java programs and Flowdroid [10] for Android applications.

4.1 XPath Injection Example

Similar to other types of code injection, an attacker enters (malicious) strings such that the resulting XPath query grants unintended access to XML data. We reproduce here the Java-based example given in the OWASP XPath injection page [31]. For more general information and other possible attacks and countermeasures, see [22]. According to the scenario in the OWASP example, an application includes the following XML document to keep personal data:

figure c

An XPath query to select the nodes in the XML documentFootnote 2 is:

figure d

The example uses the following lines of code to embed this query into source code to select nodes from the document:

figure e

This implementation is vulnerable to attacks that exploit XPath syntax and allow attackers to enter malicious input. For instance, a user could enter ’ or ’1’=’1 into the field employeeID . Then, the query formed by the application amounts to

figure f

which in turn returns all the records in the XML document.

A programming guideline to avoid XPath injection is to sanitize the user input as for other kinds of injection attacks. As explained earlier, checking if this guideline is followed can be reformulated as a taint analysis. This example is actually simpler than the one described in the introduction as there is only one way to sanitize input rather than several ones depending on the context.

4.2 Soot Implementation

We explain the first part of implementation, namely an intraprocedural forward flow analysis for which Soot provides a framework. To analyze a method body, we use Soot’s Jimple intermediate representation.

In a Jimple body, we call a variable tainted (i) if it is a string variable whose value is obtained through user input (i.e. the getParameter() method in the previous subsection) or (ii) if it is a concatenation of two strings where one or both of them are tainted. If we restrict ourselves to intraprocedural analysis we must also deem all return values of method calls tainted as well as contents of fields (object variables). Below, we describe an interprocedural extension that deals with the first restriction; dealing with the second one will be left for the future.

We use the standard Java class Boolean to represent the values tainted and untainted, i.e., true for tainted and false for untainted. Each (local) variable in a Jimple body is mapped to a boolean taintedness value. To model the mapping between variables and boolean values, we define a map as \({\texttt {TreeMap<Value, Boolean>}}\) and implement flowsets with a new class AbstractState that includes the map as a field. The reason that the key set is Value rather than Local stems from the fact that Value is the least common superclass of Local (local variables) and IdentityRef (formal parameters). Our abstract state also contains a boolean field broken indicating whether or not the guideline has been violated up to this point. This would be the case, if we apply the framework method compile to a tainted argument.

figure g

Then, we instantiate \({\texttt {ForwardFlowAnalysis<N,A>}}\) in Soot as ForwardFlowAnalysis \(<\) Unit, AbstractState \(>\). According to the Soot framework, flow analysis requires to implement some methods after extending the flow analysis class; a constructor and the methods copy(), merge() newInitialFlow(), entryInitialFlow() and flowThrough().

Merging is implemented using union and disjunction. The analysis starts with the assumption that all variables are untainted and that we are not broken yet. The method flowThrough() describes how the abstract state is changed upon the execution of basic statements, i.e. assignments and method calls.

figure h

At the end of the execution of this block the “values” r1, r2 will be untainted whereas r3 and r6 will be. The field broken will be false, however.

Interprocedural Analysis. The local variable r6 in the example above was deemed tainted just because we did not track taintedness across method boundaries. In order to achieve that in a meaningful way we need to track the dependency of taintedness of locals (and also broken-ness) on the taintedness of parameters.

We therefore use a more refined abstract state where Booleans are replaced by elements of the lattice \(L=\mathcal {P}(P)\cup \{\top \}\) where P is the set of formal parameters including this, IdentityRef in Soot. A lattice element \(\{x,y,z\}\) represents taintedness if one of x, y, z are tainted or rather definite untaintedness if none of x, y, z are tainted.

figure i

In this case, the initial abstract state will set each formal parameter x to \(\{x\}\) whereas the other locals are set to \(\emptyset \).

figure j

In this example, the abstract state is initially \((\{ x \mapsto \{ x\}, y \mapsto \{ y\}, z \mapsto \emptyset , t \mapsto \emptyset \},\emptyset )\). The resulting abstract state is expected to be \((\{ x \mapsto \{ x\}, y \mapsto \{ y\}, z \mapsto \{ x, y\}, t \mapsto \emptyset \},\{x\})\) signifying in particular that z is tainted if either of the arguments is and running the body will violate the guideline if x is tainted.

We then obtain the desired interprocedural analysis by making iterated calls to the intraprocedural analysis class. First, we set up a table; a treemap that maps each method to a summary, where a summary comprises two lattice elements, one explaining the taintedness of the return value of the method and another its broken-ness, i.e., under what conditions on the taintedness of the formal parameters running the method might violate the guideline.

figure k

The summaries for the built-in methods append, compile and getParameter are as follows:

$$\begin{aligned} \begin{array}{l| ll} &{} \texttt {ret} &{} \texttt {broken} \\ \hline \texttt {append} &{} \{\texttt {this},p_0\}&{}\emptyset \\ \texttt {compile} &{} \emptyset &{} \{p_0\} \\ \texttt {getParameter} &{} \top &{} \emptyset \end{array} \end{aligned}$$

Thus, \(\texttt {getParameter}\) in itself never violates the guideline (\(\texttt {broken}=\emptyset \)), but its return value is always tainted (\(\texttt {ret}=\top \)). Similarly, running \(\texttt {append}\) does not per se violate the guideline, but its result is tainted once \(\mathtt{this}\) or \(p_0\) are. Running \(\texttt {compile}\) on a tainted input violates the guideline, (\(\texttt {broken}=\{p_0\}\); here \(p_0\) denotes the first (and only) formal parameter of the method), the return value of \(\texttt {compile}\) is never tainted.

Next, for each method body and its control flow graph we compute the abstract states at exit points by using getTail() and getFlowAfter().

figure l

Herein, cls refers to the entire set of classes comprising an application to be analysed. After the above code has been executed we then check whether the updated method table newtable is equal to mtable (in the extensional sense) in which case we have reached a fixpoint and mtable contains the results of our interprocedural analysis. Otherwise, we replace mtable with newmtable and start over. It is easy to see by induction on the number of iterations that newmtable is pointwise a superset of mtable so that a fixpoint will be reached and the iteration is guaranteed to terminate.

We also remark that interprocedural analysis required us to implement flowsets (for intraprocedural analysis) with a tree-like structure, whereas it is common to implement flowsets with hash sets. This was due to strange interactions with hashing that occurred during interprocedural analysis.

Example 3. The current implementation successfully handles the motivating example above and also the following more artificial one which demonstrates that we support method polymorphism.

figure m

However, as already mentioned, we do not presently handle class fields. To do that properly, a region typing [12] or a points-to analysis would be necessary. The following example is therefore rejected since we over-cautiously consider all content of fields potentially tainted.

Example 4

figure n

5 Conclusion

We have described our plans for a configurable, type-based analysis for the enforcement of programming guidelines. In contrast to most existing uses of formal methods in the context of secure programming, we aim at an easily configurable analysis that guarantees adherence to well-specified guidelines rather than the absence of vulnerabilities. While guidelines often implicitly form the basis of code analysis tools they are here the primary goal of the analysis. While of course the ultimate goal is to avoid attacks we find that it may, not least for reasons of ultimate responsibility, be better to ensure automatically that expert advice has been followed and that it is crystal-clear which piece of advice (guideline) has been followed. We therefore consider the precise formalisation of guidelines an important concept which is now, of course to be demonstrated as feasible. Our approach focuses not only finding bugs but also human factors of analysis. One of our goals is to provide human-readable output which presents detailed information about how the guideline is violated in the source code. Our decision to implement on top of Soot framework complies with this goal since Soot has built-in visualization tools.

As a first starting point in this direction we have developed, within Soot, a concrete analysis enforcing adherence to a guideline against XPath injection proposed by the OWASP consortium. The next steps will consist of implementing the region-based type system from [12] within Soot which will improve context sensitivity of our analysis. On top of that, we can then develop further guidelines including those described semi-formally in Sect. 3.