Keywords

1 Introduction

Denial-of-service (DoS) attacks based on algorithmic complexity have received increasing attention recently [26, 29].Footnote 1 Unlike classical DoS attacks based on flooding an application with network requests, or exploiting bugs that crash applications, algorithmic complexity-based DoS attacks target the exhaustion of computational resources such as CPU or memory, with small inputs that cause worst-case performance behaviour in a program [7]. Some of the more well-known attacks based on this class of vulnerabilities are regular expression DoS (ReDoS) [29] that target regular expression engines, and HashDos [7] that target hash functions.

A complexity-related DoS vulnerability can be due to programs recursively traversing composite data structures. Tree and graph-like data structures that are composed of parts are common in programs, and when such structures are defined recursively, it is practical to define recursive operations over them. Such an operation can potentially run in exponential time or/and space if redundant traversals are not (or cannot be) controlled. These performance-related vulnerabilities can be exploited to carry out DoS attacks on systems. Serialisation,Footnote 2 which externalises a program’s internal data structures to disk or for transmission over a network, and external format parsers that utilise these data structures present opportunities for DoS attacks based on these vulnerabilities [9].

Static and dynamic analysis techniques have been used for detecting performance-related bugs in programs. However, most existing approaches for detecting them are domain-specific - for instance, detection techniques for regular expression engines [33]. Fuzzing has been used to detect performance defects in programs. However, fuzzers are inefficient by nature with normal turnaround times in hours or more [16]. Constraint-based techniques such as symbolic execution are limited when it comes to producing complex inputs and there is little work on using them to detect performance bugs, especially for complex inputs [20, 23].

This work presents a characterisation of a class of vulnerabilities along with a novel approach to detect them. Our approach is based on modelling the vulnerabilities, implementing the analysis for the Java language and constructing payloads to verify the analysis reports. Finally, the constructed payloads are used to check if other libraries are vulnerable as well. We characterise some of the program structures that facilitate such an attack. This is broken down into three parts (we refer to these parts as the three T’s):

  1. 1.

    Topology: a data structure that has a topology which allows the redundant execution of recursive code/methods.

  2. 2.

    Traversal: the presence of recursive methods that operate on the elements in the data structures identified in step one.

  3. 3.

    Trigger: an execution path, in a program, from an entry point method for the program, typically a method that loads and evaluates data, to a recursive method identified in step two.

We implemented this analysis for Java and then evaluated it on a set of 16 Java parser libraries for different data formats. The scope and impact of this study goes beyond the vulnerabilities found in the Java libraries as these libraries are used in numerous applications. We validated the vulnerabilities by constructing malicious inputs, and it turns out that some of these reveal vulnerabilities in libraries and applications written in other languages such as Rust and PostScript.

In our study, we found a total of 11 vulnerabilities: Four new vulnerabilities in Java libraries using the analysis, (i.e. Apache PDFBox, PDFxStream, Apache Batik and SnakeYAML), and seven vulnerabilities in non-Java libraries found during the evaluation. All 11 issues were reported to the vendors (7 were accepted as security bugs) and four CVEs were obtained as a result. We have made the implementation and results publicly available for replication.Footnote 3

2 Motivation

The vulnerabilities that we study are closely related to the well-known billion laughs attack on XML parsers [6] where the payload consists of an XML document with nested XML entities where each entity’s definition contains references to the preceding definition. Parsing the document results in an output, which has a length exponential in the depth of the nesting, that causes the service to degrade or fail from memory exhaustion.

The basic idea of billion laughs can be ported to an attack on Java programs as shown by Coekaert’s SerialDoS vulnerability [5]. In SerialDoS, shown in Listing 1.1, the equivalent of the nested entity references in an XML element is a serialised collection of nested sets.

figure a

The problem occurs if root.hashCode() is invoked. The hashCode of HashSet is recursive, i.e., it is computed using the hash values of the elements of the respective sets. This leads to a computation that is exponential in depthN.

The first enabling property for this attack is the shape of the data structure, forming a network of cross-referencing parent-child relationships, where each child node is referenced by more than one (in this case: two) parent objects. The resulting object graph for the listing is shown in Fig. 1. The second ingredient is the recursive method that operates on this structure, hashCode() in this case.

Finally, there must be a way to parse the format to an internal representation with the same topology and trigger the traversal over it. The fact that the method can be reached from the entry point of a program can be statically determined by examining direct paths from the entry point method to the target method in the call graph.

In the case of SerialDoS, the trigger is the deserialisation API. Serialisation is a common attack surface for Java applications as demonstrated by Frohoff et al. [2, 10]. The deserialisation of the HashSet instances encountered in the stream will then invoke hashCode() via a call graph chain from readObject(), and the malicious computation is activated.

Redundant traversals can be solved using dynamic programming in some cases. However, a solution is not available in the case of SerialDoS where it is not a programming defect that gives rise to the vulnerability. In SerialDoS, each method invocation happens within another context (i.e., state of the stack), and for a programmer to cache intermediate results would require knowledge about the state of the heap and stack during each invocation.

Fig. 1.
figure 1

Many-to-many topology in object graphs – each vertex represents an object and an edge represents a reference. Note that parents have two children, but children also have two parents. The graph is abstracted in the sense that intermediate references to elements in java.util.HashSet are not shown.

3 Characteristics of the Vulnerability

In this section, we discuss the three defining characteristics, TTT, needed to craft DoS attacks based on vulnerabilities in code recursively traversing data structures: topologies, traversals and triggers.

3.1 Topologies

In Java-like languages, an instance of a data structure forms a graph of objects as depicted in Fig. 1. An object, parent, references \(child_1\) if \(child_1\) is the value of a field of parent. Often, references are indirect via intermediate objects, in particular arrays or collections. Given an object graph, we are particularly interested in subgraphs formed by objects of some type T, where these objects have more than one predecessor and/or successor of type T. These structures can be described as following a many-to-many pattern.Footnote 4 Examples of this are common in the Java collection library, where, for instance, lists can be elements of multiple other lists.

A pattern that is very widely used, and conceptually similar to many-to-many, is composite [11]. A composite has containers with elements that are either containers as well, or leaf nodes. Usually, there are dedicated container and child types subtyping a more abstract component type, but variations of this pattern are common.Footnote 5

Whether a composite can form a many-to-many graph depends on how the parent-child relationships are represented and controlled. Often it is an implicit precondition for an operation with a composite as an input, that at most, the composite has a single reference to any of its children. This is often enforced by an explicit parent reference in the children, and an API that maintains the consistency of the parent-child and child-parent references.Footnote 6

3.2 Traversals

A many-to-many pattern in an object graph describes the data structure that can be exploited in SerialDoS-style attacks. To actually launch an attack, a function that operates recursively on this structure must be present. Regarding these functions, or methods as they are called in object-oriented languages, there are two aspects to consider. Firstly, recursion can be simply described by the presence of strongly-connected components in the call graph.

Secondly, the recursive method needs to traverse the many-to-many object graph. Consider such a method m with formal parameters \(m.this,m.arg_1,m.arg_n\). Then, there must be a call graph chain linking m to itself. Additionally, at this invocation there is a parameter that points to a child of an object that the respective parameter points to at the previous invocation.

This captures several common traversal patterns found in real world programs. The most obvious one is direct recursion, where the many-to-many elements are always referenced by this.

There are more complicated traversal patterns though. For instance, it is possible to implement traversals using static methods in some class, where the traversed data structure is only passed as a parameter. This can be used to program traversals implemented outside the data structure being traversed. A more standardised way to achieve this is using the visitor pattern [11], which factors out the operation to be performed on a data structure from its implementation. Note that the component of traversals we have discussed so far concerns traversing the graph depth-wise. Another aspect is how a method traverses the children of each parent. In our example, this is accomplished through the for-loop which iterates over the children in elements. A generalised technique is to use the iterator pattern [11], which abstracts the traversal of different implementations of collection types.

3.3 Triggers

The presence of the topology and the traversals itself is not sufficient to launch an attack. It is also necessary to create objects that instantiate the many-to-many pattern, and to trigger the traversal. This requires an external data structure to be translated to a vulnerable internal representation of the object graph in a data format, interpreted or otherwise processed, and in this process the traversal is activated. The activation refers to an invocation chain from a method that is called by the library or the framework when the data is processed, to the actual traversal method (e.g., deserialisation in SerialDoS). In general, a trigger is a method in the public API of a library that is invoked when data is processed (either by a client program, or by the library itself), and that leads to the invocation of the traversal (method).

4 Modelling the Analysis

In this section, we describe how we model the topology, traversal and trigger pattern for our analysis. We formalise the analysis in graph-theoretic terms using three graphs that model different aspects of program behaviour. These definitions are language-agnostic, and are applicable to languages with common object-oriented language features. We occasionally refer to Java for illustrative purposes.

4.1 Preliminaries

The Type Graph. A type graph models classes and their relationships in a program written in an object-oriented programming language, formally defined as follows:

Definition 1

The type graph of a program is a directed labelled graph \(\langle T, E \rangle \) where

  • T is a set of vertices representing classes and (nested) arrays of classes that occur in the program, we represent vertices using their class name, with the suffix \([ \, ]\) for array types as usual. For the sake of brevity, we identify those names with the respective vertices from now on.

  • \(E \subseteq T \times T\) are the graph edges. We consider edges of two kinds, i.e., \(E = E_{subtype} \cup \; E_{assoc}\), representing subtype and association relationships, respectively.

Subtype edges between vertices represent subtype relationships. The particular rules differ between languages, for Java, they are defined in [13, Sect. 4.10]. We use the simplified notion “B is subtype of A” to say that there is a path consisting of subtype edges linking B to A. That is, the edge \((B \rightarrow A) \in E_{subtype}\). If \(C \in T\) is either representing an array, or a class that is a subtype of a container type (such as Java’s java.util.Collection), then we call C a container type. If a container type is an array A[] then A is called the component type of the container. If the language supports the declaration of component types for container types, using some mechanism like Java’s generics [13, Sect. 8.1.2], , then A is called the component type of C. Association edges represent the relationships between container types and their respective component types.

The Points-To Graph. A points-to graph models the memory during program executions. There are several versions of points-to analysis, our representation is used in an Andersen-style, field-sensitive analysis [30]. In a points-to graph, object abstractions and variables are represented by vertices, and allocations, assignments, field or array stores and loads are represented by edges which define the flow of values in the program being represented.

Definition 2

The points-to graph of a program is a directed labelled bipartite graph \(\langle O,V, E_{alloc},E_{assign},E_{load},E_{store} \rangle \) where

  • V is the set of vertices representing the variables in the program

  • O is the set vertices representing object allocation sites in the program

  • \(E_{alloc}\subseteq O \times V\) is a set of allocation edges, modelling memory allocation

  • \(E_{assign} \subseteq V \times V\), a set of assignment edges modelling variable assignment

  • \(E_{load} \subseteq V \times V\), a set of field load edges modelling field loads, labelled with the respective field name

  • \(E_{store} \subseteq V \times V\), a set of field store edges modelling field stores, labelled with the respective field name

Array access can be modelled similarly to field access. We omit details for the sake of brevity here.

Given a points-to graph, the objective of a points-to analysis is to infer additional flowsTo edges \(E_{flowsTo} \subseteq O \times V\) describing the relationship between abstract values and variables pointing to them.Footnote 7 This is a computational complex problem, usually solved by computing CFL-reachability via a fixpoint algorithm [8, 28].

One of the main uses of the points-to graph is alias analysis. Two variable vertices \(v_1,v_2\) alias if there is an object vertex \(o \in O\) and paths consisting of flowsTo edges from o to both variable vertices \(v_1\) and \(v_2\). Aliasing means that both variables can point to the same memory location. Furthermore, this can be used to define a heap access path between variables. A heap access path between \(v_1\) and \(v_2\) consists of a sequence of load edges where the destination (sink) of an edge in the sequence aliases with the source of the next edge, \(v_1\) is the source of the first and \(v_2\) the destination of the last edge in the sequence. This models (nested) field access in a programming language, i.e. statements like foo.f.g. By accounting for aliases, this also covers field access with intermediate variables, in programs like x = foo.f; x.g;.

The Call Graph. A call graph statically models the interprocedural calling (invocation) behaviour of a program. Methods are represented by vertices and interprocedural calls by edges.

Definition 3

The call graph of a program is a directed graph \(\langle M, I \rangle \) where

  • M is the set of methods in the program

  • \(I \subseteq M \times M\) is a set of edges, \((m, n) \in I\) means that method m has a call site with an invocation of n.

A call graph can be constructed by analysing invocation instructions found in program code. To model runtime behaviour in languages with dynamic dispatch, additional edges must be inferred (devirtualisation). There are various algorithms available for this purpose differing in their precision and efficiency (e.g. CHA, VTA) [30]. The more precise methods require points-to information to determine the type of the objects a receiver points to in a method invocation, a process often referred to as call graph construction on-the-fly.

Cross-Referencing Graphs. The three models defined above are widely used in static program analysis. In practice, they are often combined. There are certain relationships between these models we will exploit in our analysis.

Firstly, for a given method m, the variables in the points-to graph include the return value and the parameters of this method. We denote the parameters, including the receiver of an invocation (in Java, the this reference), as \(param(m) \subseteq V\). Secondly, call graph vertices can be associated with type graph vertices via the types that define those methods. We refer to those types as the owner of a method m, \(owner(m) \in T\). Finally, allocation vertices \(o \in O\) in the points-to graph can be associated with the types they instantiate, \(type(o) \in T\).

4.2 Analysis Specification

Topologies. Given a type graph \(\langle T, E \rangle \), we describe an instance of composite as a mapping between the two roles in the composite design pattern [11] and actual types that occur in the program.Footnote 8

Definition 4

Given a type graph \(TG = \langle T, E \rangle \), a composite is a mapping \(\{cont, comp\} \rightarrow TG\) such that the following two conditions are satisfied:

  • (\(composite(cont),composite(comp)) \in E_{assoc}\)

  • (\(composite(cont),composite(comp)) \in E_{subtype}\)

Traversals. We define a traversal as the presence of a recursive invocation in the call graph and the flow of an object from a field of a composite to an argument of the recursive call.

Definition 5

Given the type graph \(TG = \langle T, E\rangle \), the call graph \(CG = \langle M, I\rangle \) and the points-to graph \(PG= \langle V,O, E_{alloc},E_{assign},E_{load},E_{store} \rangle \), a traversal is a method \(m \in CG\) such that:

  • \(m \in M\) is recursive in CG, i.e. CG contains a path connecting m to itself

  • there is a composite c and \(c(comp) \in param(m)\)

  • there is a heap access path in PG from some parameter of \(v \in param(m)\) to the respective argument in the recursive call site for m

Triggers. We define a trigger as the presence of a method that instantiates a composite through a chain of method invocations. Once the composite is instantiated it would also trigger the recursive traversal with the composite as an argument.

Definition 6

Given a call graph \(CG = \langle M, I\rangle \) and the points-to graph \(PG= \langle V,O, E_{alloc},E_{assign},E_{load},E_{store} \rangle \), a trigger is a method \(trigger \in M\) in the call graph that is reachable from a program entry point \(ep \in M\). There must also be a path in the CG from the trigger to a traversal method with the instantiated composite as an argument in PG.

5 Experimental Setup and Evaluation

5.1 Approach

The approach focuses on first running the static analysis to detect instances of the TTT pattern in Java parser libraries, then using the analysis results to confirm whether we can construct payloads to exploit vulnerabilities, and finally using the payload for non-Java parsers (e.g., C, Python, Rust etc.) to check if these parsers are also vulnerable. The payloads were constructed manually using format specifications, source code indicated by analysis results. The effort required to construct payloads varied across different libraries, and this process is discussed in more detail in Sect. 5.5.

5.2 Implementation

The analysis was implemented as an extension of Doop [1], a state-of-the-art static analysis framework for Java, which encodes static analyses declaratively in the Datalog [4] language. Datalog programs are a natural way to express the graph-based algorithms [35] used in the specification of our analysis. Datalog-based formulations of static analyses have been used successfully in bug and vulnerability detection [14, 18, 27]. The underlying analyses in Doop compute points-to information and call graphs from an input program, and they can also be used to obtain the type graph.

5.3 Libraries for Analysis

We evaluated the analysis on a set of 16 widely used Java parser libraries for different data formats (Table 1). These popular libraries are known to parse external data formats, and are therefore prone to the vulnerabilities we study. These libraries process data used in messaging, object serialisation and document representation, represented in various text and binary formats. We covered libraries for parsing or processing XML, JSON and YAML, PDF and external DSLs. Table 1 also contains usage data showing how many Maven artefacts depend on those libraries. This provides some indication of the impact vulnerabilities in these libraries have based on usage statistics.Footnote 9

Table 1. Java libraries for analysis

After analysing these 16 libraries, we proceeded to evaluate whether libraries for other languages, shown in Table 2 were vulnerable using the payloads constructed. These libraries were selected based on availability and popularity. librsvg is used in the GNOME desktop,Footnote 10 Ghostscript is widely deployed and used for processing PDFs, a popular vector-based illustration software - Inkscape uses librsvg and cairo.Footnote 11 InkscapeFootnote 12 is also used by ImageMagick for SVG processing. We looked at solutions for sanitising SVG files, and found that svg-sanitizer is the most widely used (e.g. WordPress, drupal).

Table 2. Non-Java libraries investigated

5.4 Triggers or Entry Points

Identifying a trigger is a manual step that requires domain knowledge of the library under analysis. For image formats this could be the rasterisation/conversion process that would require traversals of the structure. Some libraries have command line interfaces which initiate calls to the trigger methods. In the case where we analysed libraries without command line interfaces, a driver was required as an entry point for the input program for the analysis. We have written custom drivers for libraries that are not bundled with a command line interface. The driver provides an entry point as well as a facility to interact with the library’s API. In the case of SnakeYAML, the driver consists of statements to instantiate the parser and load a file. Only MVEL2, PDFBox, Batik and PDFxStream come with built-in command line interfaces and did not require custom drivers.

5.5 Evaluation

Static Analysis. The experiments were performed on an Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz with 64 GB of RAM on Linux Ubuntu 18.04.3. Doop was run using the Java 8 platform as implemented in Oracle’s version 8 of the JDK. We used a context-insensitive analysis. the following options for the analysis:

  • analysis: -a context-insensitive

  • main class: -main option was used with the driver class as an argument.

For each project, we extracted facts from the input library, and then executed Doop with custom rules to compute:

  • Composites (i.e., facts instantiating the Composite rule).

  • Recursive methods.

  • Heap flows to refine the list of recursive methods, i.e. an object that is of composite type must flow to the parameter of a recursive method.

  • Methods with heap flows from the previous step, and that are reachable via the entry point.

Manual Evaluation of Analysis Results. At the end of the static analysis we find a set of candidate instances, i.e. bindings of the concepts used in TTT to concrete artefacts within the program under analysis. However, these may contain false positives as the malicious computation is effectively prevented by some program logic. While we cannot accurately eliminate false positives, we conducted a manual step to identify true positives, by constructing payloads that expose the respective vulnerability. This consisted of inspection of the program’s source code, debugging and reviewing specifications against the implementation for the particular parser library.

Table 3. Overview of experiments (composites and direct recursion)

6 Results and Discussion

Table 3 shows a summary of the outcomes of the experiment, including the static analysis run times for the 16 libraries. Methods and composites from the library and their dependencies are shown. The Topology column lists all composite types in the library. The Methods (Composite) column lists only those methods that have a composite type as a parameter. From these methods, the Traversal column lists methods that have a value flow within the composite field to the recursive callsite. The Triggered Traversals column lists reachable traversals for the identified topologies. In the following sections, we discuss some of the vulnerabilities detected in more detail.

6.1 PDF Vulnerabilities

The analysis detected 11 triggered traversals that recurse on a parameter in the PDFBox library. From these results, a vulnerability was confirmed in Apache PDFBox, the most used Java PDF library in the Maven repository.Footnote 13

A PDF document’s format, Carousel Object Structure (COS), is described in the PDF Reference [24]. It supports basic types such as booleans, integers, real numbers, strings, names and more crucially, arrays and dictionaries. The particular composite topology in the library consists of COSDictionary as the container and COSBase as the component where the children are stored in an object that implements the Map interface. The recursive method that traverses this structure is checkPagesDictionary(COSDictionary pagesDict) defined in org.apache.pdfbox.pdfparser.COSParser, which is invoked when the PDF file is parsed. The only constraint in the path condition from the entry into the method to the recursive call is the presence of child objects that are of the same type as the passed parameter.

Manual inspection of the source code and the PDF specification [24] revealed that the root of a PDF document, the catalog, points to a dictionary referred to as a Page Tree, which can in turn refer to another Page Tree. This structure parses to a COSDictionary composite and we can craft a PDF document that parses into an object graph with the many-to-many pattern.

Passing the crafted PDF to the application revealed that it can result in attacks on responsiveness and disk space as the application can also be used to convert the pages in a PDF to disk as images. The issue was reported and it has been accepted with the identifier CVE-2018-11797.

The same PDF document was used to confirm the vulnerability in PDFxStream (CVE-2019-17063), and it also revealed the same vulnerability in PDFtkFootnote 14, the PDF toolkit. They both use a HashMap to store the COS structure, which in principle makes a DoS attack possible.

We also tested the PDF document on Ghostscript [12], a PostScript and PDF interpreter. Using the crafted PDF as an input to Ghostscript resulted in DoS. This bug was accepted as a security vulnerability (CVE-2018-19478). The PDF parser in Ghostscript is implemented in PostScript and the traversal of the COS was for an entirely different purpose when compared to the previous cases, which was to detect cycles in the Page Tree that had caused a security vulnerability in Ghostscript. This suggests that traversals of the form that we have studied occur across multiple languages.

6.2 Scalable Vector Graphics (SVG) Vulnerability

SVG [31] is an XML-based format for two-dimensional graphics supported by web browsers and is used in illustration programs. SVG is processed by the Batik library. The analysis reported 34 triggered traversals for the library. One particular composite topology in the library consists of Node as the container with children or parents of the same type. The recursive method that traverses this structure is String getCascadedXMLBase (Node node) defined in org.apache.batik.anim.dom.SVGOMElement, which is invoked when the SVG file is parsed. The only constraint in the path condition from the entry into the method to the recursive call is the presence of child objects that are of the same node type as the passed parameter (i.e. XML element nodes).

Any SVG graphics element is potentially a template object that can be re-used (i.e., instanced) in the SVG document via a element. The element references another element and indicates that the graphical contents of that element is included/drawn at that given point in the document. The element can be used to specify a grouped container of elements. The element in conjunction with the element can be used to construct a nested structure to trigger the detected vulnerability. The element can also be used to construct the SVG version for SerialDoS as shown in Listing 1.2. We based our SVG file on this ability to nest references using and elements. There is an additional way to reference elements, as shown in Listing 1.3, which uses the pattern tag and its fill attribute set to a url function containing the reference id for an element in the document.

figure k
figure l

The same SVG document was used to verify vulnerabilities in web browsers, and a core Linux SVG rendering library (librsvgFootnote 15). The issue was reported for librsvg and fixed by the vendor (CVE-2019-20446). We found the crafted file to impact all tested browsers (e.g. Mozilla Firefox (version 73.0), Google Chrome (Version 77.0.3865.120, Official Build) by excessively consuming resources (memory and CPU) for Firefox and crashing the active browser tab in Chrome. This can be used by malicious parties to craft client-DoS for websites that allow links to SVG code in user input (e.g. Markdown with links to external SVG files in user comments). We confirmed this observation for the StackOverflowFootnote 16 Q&A platform, GitLabFootnote 17 and GitHubFootnote 18 issue trackers. The impact on these services is that they can render the page inaccessible to users if it has malicious SVG content.

We also considered svg-sanitizerFootnote 19, which performs server-side sanitisation of SVG content (used in Drupal and WordPress as a plugin). On passing the crafted SVG file as input, svg-sanitizer, entered a non-terminating computation which make services using the plugin susceptible to DoS attacks. This issue was reported to the developer and it was fixed by adding a check to limit levels of use recursion during SVG sanitisation.

6.3 YAML Vulnerability

YAML is a popular and widely used (human readable) serialisation language for data interchange and application configuration. It supports primitives and common data structures such as maps and lists [34]. We looked at the SnakeYAML library in Java and the analysis reported three triggered traversals, one of which involves the composite MappingNode with a list of NodeTuple as children that can potentially have the same type as the parent.

The code is the implementation of the merge key feature in YAML, which is used to indicate that all the keys of one or more specified maps should be inserted into the current map. If the value associated with the key is a single map, each of its key/value pairs is inserted into the current map, unless the key already exists in it. If the value associated with the merge key is a sequence, then this sequence is expected to contain multiple maps and each of these are merged in order. Listing 1.4 shows the use of YAML merge as well as the use of YAML aliases in constructing SerialDoS type inputs, which were detected as vulnerabilities by the analysis.

figure n

We created a YAML file with nested merges and nested lists with aliases forming the topology, and passed the file as input to our SnakeYAML driver to confirm that it crashed from stack exhaustion for the nested merges case, and entered a long-running computation for nested lists. Consequently, this issue has been reported to the maintainer.

6.4 Newly Discovered Security Vulnerabilities

Following the guidelines for responsible disclosure, we have reported all vulnerabilities to the libraries’ developers/maintainers. We provide a timeline and the statuses of each of these vulnerabilities below (Table 4).

Table 4. Status of reported bugs and vulnerabilities status.

6.5 Threats to Validity

Manual confirmation of the vulnerabilities reported by the tool poses a threat to the validity of the evaluation. However, for the most likely candidates, we were able to construct payloads and confirm that the reports are actual bugs. Even though a hand-selected set of projects was used in the evaluation, the generality of the model and the discovery of related bugs in other libraries are encouraging.

7 Related Work

7.1 Detecting Algorithmic Complexity Vulnerabilities

Wuestholz et al. [33] discuss an approach to statically detect DoS vulnerabilities in programs that use regular expressions. The analysis has multiple stages and is conceptually similar to the analysis proposed in this paper: they first build a model to detect vulnerable structures (by reasoning about the worst-case complexity of NFAs), and then devise a separate (taint) analysis to establish whether a vulnerable regular expression can be matched against an input string. The tool they have developed, Rexploiter, finds 41 exploitable security vulnerabilities in Java web applications. Holland et al. [15] propose a hybrid approach to detect algorithmic complexity vulnerabilities. In a static pre analysis step, they use a loop call graph to detect nested loop structures that are susceptible to algorithmic complexity vulnerabilities. The first step is similar to our approach, but uses a different model. Our approach is based on the presence of higher level data structures and recursive methods which then implicitly create the nested traversals.

7.2 Traversals/Performance Bugs

The detection of performance bugs and in particular redundant traversal is a problem related to DoS vulnerabilities. Olivo et al. [21] study redundant traversal performance bugs in Java code, limited to traversals in non-recursive functions, and a static analysis, Clarity, to detect them. Burnim et al. [3] present WISE, automated test generation for detecting worst-case complexity in programs. WISE uses symbolic test generation. Jiayi et al. [32] describe Singularity, another input generation technique for detecting worst-case performance bugs in Java programs. Singularity uses a greybox fuzzing technique that looks for critical input patterns modelled as recurrent computation graphs (RCGs). Their technique reveals performance and DoS-related bugs in real world programs. Other fuzzing approaches include SlowFuzz and PerfFuzz [17, 25]. Nistpor et al. [19] propose Toddler, an example of dynamic analysis to detect performance bugs. Toddler instruments loops and read instructions, and uses the data collected using inserted code to detect similar memory-access patterns. Padhye and Sen [22] describe Travioli, a dynamic analysis technique for detecting data-structure traversals. It is also based on instrumenting code in order to harvest trace data, from which the analysis model is built. The purpose is similar to what we try to achieve in this paper with the topologies and traversal steps of our model, however this being a dynamic model, it has different tradeoffs between precision and recall, and its quality depends on the existence of drivers (such as unit tests) that exercise a large part of the program under analysis.

8 Conclusion

We presented an approach to classify and detect a class of DoS vulnerabilities in parsing data structures. We evaluated this approach on a set of 16 Java parser libraries with a Datalog-based formulation of a static analysis using the Doop analysis framework for Java. The study revealed four new vulnerabilities in widely used Java PDF, SVG and YAML libraries. A further evaluation also revealed seven more vulnerabilities in parser libraries for Rust, PHP, C++ and PostScript. Out of these reports, we have obtained four CVEs and reported a total of 11 security issues to vendors (7 of which have been accepted). The results confirm that a lightweight static analysis can be useful in uncovering vulnerabilities that belong to this class. Possible directions for future work include using micro-fuzzing to fuzz the recursive functions reported by the static analysis for more precise results, and using constraint-based approaches to more precisely identify the topology and traversal patterns reported by the analysis.