Representing LLVM-IR in a Code Property Graph

Küchler, Alexander; Banse, Christian

doi:10.1007/978-3-031-22390-7_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13640))

Included in the following conference series:

International Conference on Information Security

1023 Accesses
2 Citations

Abstract

In the past years, a number of static application security testing tools have been proposed which use so-called code property graphs (CPGs), a graph model which keeps rich information about the source code while enabling its user to write language-agnostic analyses. However, they suffer from several shortcomings. They work mostly on source code and exclude the analysis of third-party dependencies if they are only available as compiled binaries. Furthermore, they are limited in their analysis to whether an individual programming language is supported or not. While often support for well-established languages such as C/C++ or Java is included, languages that are still heavily evolving, such as Rust, are not considered because of the constant changes in the language design. To overcome these limitations, we extend an open source implementation of a code property graph to support LLVM-IR which can be used as output by many compilers and binary lifters. In this paper, we discuss how we address challenges that arise when mapping concepts of an intermediate representation to a CPG. At the same time, we optimize the resulting graph to be minimal and close to the representation of equivalent source code. Our case-study on detecting cryptographic misuse indicates that existing analyses can be reused and that the analysis time is comparable to operating on source code. This makes the approach suitable for a security analysis of large-scale projects.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Frama-C: A software analysis perspective

Article 09 January 2015

A Formal Semantics for P-Code

Equid—A Static Analysis Framework for Industrial Applications

1 Introduction

Despite huge efforts of researchers and industry put into identifying vulnerable software, many software systems still suffer from various security weaknesses. The concept of code property graphs (CPG) [1] has been introduced to simplify identifying vulnerabilities and bugs in the source code of programs. A CPG covers properties of abstract syntax trees (AST), control flow graphs and data flow graphs, among others, thus containing all information relevant for a security analysis. The CPG enables its user to identify vulnerabilities or bugs by performing reusable graph queries. This perk led to a widespread adaption and several implementations [1,2,3,4,5,6,7]. Even if the graphs mimic the source code with a minimal loss of information, the graph provides an abstraction of the actual code. This abstraction is suitable to support a language-agnostic analysis of software. Unfortunately, the implementations are limited with respect to the supported programming languages since each language requires a separate translation.

As compilers suffer from a similar problem, the use of intermediate representations (IR) has become popular. The IR abstracts from the programming language but, in many cases, still contains a significant amount of high-level information such as the types of variables which is lost in the compiled binary and can barely be recovered [8]. The lack of such information can worsen the analysis results. The Low Level Virtual Machine (LLVM) project proposes LLVM-IR [9], a very popular IR. Numerous compiler frontends exist to translate source code to LLVM-IR. E.g., clang^{Footnote 1} translates the languages C, C++ and Objective-C to LLVM-IR and has been extended by Apple to support Swift^{Footnote 2}. Other frontends exist to support a wide range of programming languages (e.g., Rust).

While LLVM-IR was designed for compilers, it is also frequently used as output format of binary lifters [10,11,12,13]. This way, a lifter supporting multiple architectures and types of binary files can avoid the need to implement the translation for different flavors of assembly code or application binary interfaces. As binary lifting has meanwhile become a stable technique [14], its output can be used to perform a security analysis of the binary without requiring the sources.

If source code is available, analyzing the differences between the code and the binary can identify the security implications of compiler optimizations [15]. E.g., side channel vulnerabilities introduced by compilers are still a major concern of developers of cryptographic libraries [16]. As LLVM-IR can be emitted at all stages of the compilation, it can already contain such vulnerabilities.

In this paper, we show how we overcome shortcomings of existing CPG tools by enabling the analysis of LLVM-IR in such a graph. This bridges the gap between the analysis of source code written in higher-level programming languages and the analysis of programs (or dependencies) that may exist only in binary form. When including support for LLVM-IR in a CPG, several challenges arise from the static single assignment (SSA) form, the exception handling routine, instructions missing in high-level programming languages and significantly different syntactic representations of some concepts in LLVM-IR and other languages.

Contrary to prior work, we do not require to run any LLVM passes beforehand, helping us to keep the graph smaller. At the same time, we aim to retrieve as much high-level information as possible and map the code to high-level concepts whenever possible. Rather than handling LLVM-IR-specific instructions, e.g., cmpxchg^{Footnote 3}, only as a generic function call, we translate the concepts into existing CPG node types that represent the behavior of a higher-level programming language. This allows us to re-use existing concepts in queries, such as if-statements or pointer referencing. Our case-study shows that our integration of LLVM-IR into a CPG enables us to identify misuses of cryptographic APIs, a common security weakness. It further shows that we support not only the analysis of programs whose source code is available, but also of lifted binaries, enabling us to identify weaknesses and vulnerabilities regardless of the source code’s availability. In summary, our contributions are as follows:

We present the first mapping of all LLVM-IR instructions to existing CPG nodes with full compatibility to the existing structure. This ensures that existing analyses are fully compatible with the representation.
We show how we can keep the size of the CPG minimal.
We are the first to include LLVM-IR’s exception handling routines in a CPG.
We extended the open source project cpg^{Footnote 4} to support our concepts.

2 Background

2.1 The Code Property Graph

The cpg project [7] enables a graph-based representation of source code of different programming languages. To date, the focus lies in Java and C/C++ but experimental support for Python, Go and TypeScript is also available. The goal of the project is to provide a language-agnostic representation of the source code. This enables a security expert to identify vulnerabilities or bugs. Furthermore, the cpg library comprises a way to store the graph in neo4j^{Footnote 5}, and makes the graph accessible via a command line interface. For some cases, the library can also evaluate the value which can be held by a node. All this allows a security expert to write custom queries either to the graph database or the in-memory representation of the CPG. The cpg library is designed in a way to allow reusing these queries among all supported programming languages. To fulfill this goal, the cpg library implements a thorough class hierarchy which accounts for various types of statements and expressions. The CPG encodes information such as the class hierarchy of the code under analysis, the control flow graph, and the call graph in a single graph. The current design mainly targets object-oriented programming languages. To deal with a possible lack of some code fragments or errors in the code, the library is resilient to incomplete, non-compilable and to a certain extent even incorrect code.

2.2 The LLVM Intermediate Representation

The Instructions. The LLVM-IR is used as IR of the LLVM project. Its main purpose lies in providing an abstraction of code to ease the optimization and analysis of the program in a language- and architecture-independent way. The LLVM-IR holds values in global variables (prefixed with @) and local variables (prefixed with %) both of which can be named or unnamed. The LLVM-IR follows the static single assignment (SSA) form. Hence, every variable can be written to exactly once. This limitation does not affect global variables as they are represented as memory locations and are accessed via store or load operations.

Overall, the LLVM-IR differentiates between 65 instructions. Of these, 13 are arithmetic operations, 6 are bitwise operations, and 13 instructions cast types. The remaining instructions call functions, handle exceptions, load from or store to memory, manipulate aggregated types or jump to other program locations. The instructions can be enhanced with metadata to note the calling convention, optimizations or desired properties of functions and parameters, among others.

Besides the basic instructions, LLVM-IR contains numerous so-called “intrinsics”. Those are functions which model certain standard library functionality, or model frequent actions which have to be represented differently on different architectures. Some intrinsics repeat or refine basic instructions, others insert functionality such as the automated memory management in Objective-C.

The LLVM-IR supports a simple type system and differentiates between a set of primitive types and aggregated types such as structs, arrays and vectors. Additionally, LLVM-IR has a type for labels (i.e., jump targets), metadata and a so-called token which is used by certain instructions to transport information. Overall, the type system resembles C rather than object-oriented programming languages. In fact, object-oriented concepts are handled by the respective language frontend in LLVM. The frontend translates the object-oriented properties to concepts such as VTables for overriding methods, and method name mangling to support overloaded functions. In the case of Objective-C, it uses the dynamic dispatching strategy. Other languages make use of similar concepts.

Accessing LLVM-IR. The LLVM project offers a C++ and a C API to parse LLVM-IR and LLVm bitcode files. As the CPG project is mainly implemented in Java, access to the API has to be provided via the Java Native Interface (JNI). We use the open source project javacpp-presets^{Footnote 6} which provides access to the C API via JNI. Unfortunately, the C API has a flat type hierarchy in its functions to access the LLVM-IR’s AST, thus making the parsing of instructions and the extraction of their elements more error-prone if not parsed correctly^{Footnote 7}. However, as our evaluation in Sect. 6 shows, our implementation works in a stable way.

3 Related Work

Code Property Graphs. Researchers and industry proposed multiple use cases and implementations of CPGs and analysis tools [1,2,3,4,5,6,7, 17, 18]. All of these tools differ in their support for programming languages.

Closest to our work is the tool llvm2cpg [19] which uses Joern [5] as graph representation. The respective CPG represents most instructions as function calls and does not try to infer any of the high-level information. Furthermore, it uses the reg2mem LLVM pass to address the \(\varphi \) instruction of LLVM-IR, which significantly increases the code base. This results in additional instructions present in the graph and thus slows down the analysis and makes it more error-prone.

liOS [18] constructs a CPG holding assembly instructions and the function bodies lifted to LLVM-IR to analyze iOS apps. The graph model cannot be used to represent source code. Furthermore, liOS does not specifically address LLVM-IR instructions since the analyses mainly operate on assembly code.

Plume [4] and Graft [3, 20] only support Java bytecode, a different low-level language. Plume builds the graph incrementally to analyze data flows and has been merged into Joern in a revised version. Graft follows a similar goal. Other tools [1, 5, 7, 17] analyze source code and differ in their level of abstractions and supported languages. Some tools extend CPGs for specific use cases, e.g., analyzing cloud apps [6] or finding vulnerabilities with deep learning [2].

Graph-Based Security Analysis. Various other works investigated in the usage of other graph-based representations of the source code to identify bugs or vulnerabilities [21,22,23] or similar code fragments [24, 25], traverse the graph [26] or improve the analysis [27]. These works aim to provide a rich basis for analyzing the graphs. Many of the proposed techniques operate on other graph structures (e.g. the AST). However, the CPG combines a multitude of information and includes the respective relations, thus making the required information available for the analysis. Hence, these approaches can still be applied to the CPG.

Static Analysis of Multiple Programming Languages. Other works target the analysis of multiple programming languages [28,29,30,31,32,33]. Some of the frameworks rely on language-agnostic ASTs [34, 35] or aim to provide a common pattern for the AST of multiple languages [36, 37]. However, ASTs are cannot be used to find all kinds of bugs as they do not contain the required information [1]. Teixeira et al. [38] even propose to translate source code to a custom language.

Furthermore, various intermediate representations (IRs) have been proposed either for compilers (e.g., LLVM [39], GIMPLE [40], HIR [41], or CIL [42]), or specifically targeting code analysis (e.g. VEX IR [43], jimple [44], BIL [45], REIL [46], ESIL [47], DBA [48, 49] or RASCAL [50]). Since the IRs are often tailored to a specific use case or language, they differ in the information available in the instructions and their abstractions. Many of the IRs are integrated in analysis toolchains whose analyses are often specific to a use case and cannot easily be ported to other tools. Therefore, integrating such IRs in an abstract analysis platform like the CPG can enable further generalized security analysis.

Numerous tools [51,52,53,54,55,56,57,58,59,60,61,62,63] can analyze multiple programming languages. However, they can often barely share the analyses between the languages. The CPG representation allows reusing analyses across languages.

4 Enabling the Analysis of LLVM-IR with Code Property Graphs

Existing graph-based approaches to security analysis mostly focus on the analysis of programs with available source code. In this section, we present our approach which also enables static security analysis of lifted binaries using CPGs. E.g., this allows us to detect the misuse of cryptographic functions, as shown in our evaluation (Sect. 6.1). In detail, we describe how we map different LLVM-IR instructions to CPG nodes while reusing only existing node types. We show how to represent LLVM-specific constructs similar to their equivalents in already supported programming languages. We show how to model 1) arithmetic and logical instructions, 2) access to aggregate types, 3) the \(\varphi \) instruction and 4) LLVM-IR’s exception handling routine with a minimal increase of nodes.

4.1 Basic Instructions

Many instructions are known from other programming languages. We can coarsely differentiate between arithmetic and logical operations, operations which enforce specific interpretations of types, and operations which are composed of numerous steps but are often performed atomically on the CPU. In this section, we explain how we include those respective instructions in the CPG.

Almost all programming languages have a common subset of instructions or operations. This includes arithmetic, bitwise and logic operations, or comparisons which we map to their representation in high-level languages (+, -, *, /, %, \(\hat{ }\) , &, |, , etc.). Other instructions like jumps, calls, return instructions are modeled with their representation in C code. For if- and switch/case-statements, the branches or cases contain a simple goto statement. Later, a CPG pass removes such indirections whenever possible to reduce the size of the graph.

For some instructions, LLVM-IR can enforce a specific interpretation of the types of the arguments. E.g., the instructions udiv, sdiv and fdiv represent a division and are mapped to the binary operator /. Yet, they interpret the values as unsigned (udiv), signed (sdiv) or floating point value (fdiv). In the CPG, we add typecasts to the arguments to enforce the correct interpretation. In the context of a security analysis, such information helps to determine the type and signedness of values for detecting numerical errors (e.g., integer overflows).

In addition, some comparators of floating point values check if a number is ordered or not (i.e., if it is NAN). We split these comparisons into a check if the number is ordered and the actual comparison. E.g., the comparators ult and olt compare two floating point values and are mapped to the operator. However, the ult comparison checks if a value a is unordered or less than value b and thus is modeled as the statement . Similarly, we model the olt comparison with .

Some of LLVM’s instructions like atomicrmw and cmpxchg are known from assembly code rather than high-level languages and perform multiple operations atomically. The cmpxchg instruction loads a value from memory and replaces it with an operand if the value equals another operand. In the CPG, we model this by a block of statements holding the comparison, an if statement and the assignment in the then-branch. We annotate the block to keep the information that all this is performed atomically. Similarly, we model atomicrmw as a block of statements performing a load, an optional comparison and if-statement and an assignment to a variable. By modeling these instructions with a representation similar to source code, we simplify subsequent analyses. In contrast, prior work [19] models these instructions as a call to custom functions.

4.2 Handling Aggregate Types

High-level languages provide syntactic means to access elements of complex types like arrays, structs or objects. In LLVM-IR, arrays and structs are still available and their values can be accessed by special instructions. Since operations on such types offer a frequent attack surface (e.g., for memory corruptions), an accurate modeling of aggregated types and handling the access to their elements is relevant for a security analysis.

For arrays which are represented as a vector, the instructions extractelement and insertelement provide access to the elements. Both instructions are represented as an ArraySubscriptionExpression in the CPG, one being the left-hand side of the assignment and one the right-hand side. For all other aggregate types, the instructions getelementptr, extractvalue, and insertvalue model the access to the element either by the index inside an array or by the position of a field inside a structure. The code %b = insertvalue i32, i8 %a, i8 7, 1 shows how the second element of the variable a is set to 7. We model the instruction as a copy of a to the variable b and an assignment of the value 7 to the accessed field_1. Figure 1 shows the resulting graph with the initialization of b on the bottom right, and the access to the field on the left.

The example uses an interesting concept of LLVM-IR: a so-called literal structure, a struct whose layout is defined in the instruction. For such structs, we generate a type which is identified by the types of its fields. Hence, all literal structs with the same list of fields are regarded as the same type. In our example, the struct is named literal_i32_i8 and has the fields field_0 of type i32 and field_1 of type i8. The top left of Fig. 1 shows the declaration of the type. While the instructions insertvalue or extractvalue read or write values from memory, it is sometimes desirable to retrieve a pointer to an element of a structure. For this case, the instruction getelementptr computes a memory address without accessing memory. Listing 1.1 illustrates the usage of this instruction on a named struct. Listing 1.2, in turn, shows the same code written as C. Figure 2a shows the definition of the named struct and the connections between the fields for the graph retrieved from LLVM-IR. The result is remarkably similar to the graph in Fig. 2b which represents the C code. This similarity lets us reuse existing analyses for the graphs retrieved from LLVM-IR and shows that the graphs are structurally identical. In fact, the relations between variables and fields could be better resolved which can lead to improved analysis results.

4.3 The \(\varphi \)-Instruction

The SSA form enforces that each variable is assigned exactly once in LLVM-IR. However, in some cases, it is required to assign a value multiple times. A frequent example is a loop counter which is set before executing the loop and is updated on each iteration. To allow such behavior without duplicating code and without storing the values in memory, the \(\varphi \)-instruction is used. It assigns the target variable one of the inputs based on the previously executed basic block (BB).

As most programming languages do not have such an instruction, there is no fitting node to represent this in the CPG. To address this issue, prior work [19] relied on the LLVM reg2mem pass^{Footnote 8} which translates the instruction to multiple load and store operations. However, this pass also transforms the access to other variables and thus significantly increases the size of the resulting CPG. As this reduces the scalability of subsequent analyses, we avoid this LLVM pass. We collect all \(\varphi \)-instructions during the translation. Finally, we parse the instructions to identify the predecessor BBs and add an assignment to the target variable at the end of the BB. To keep the CPG clean, we further insert a declaration of the variable at the beginning of the function containing the \(\varphi \)-instruction and all BBs^{Footnote 9}. This, however, breaks the SSA form. The snippet in Listing 1.3 contains the \(\varphi \)-instruction while Listing 1.4 shows the function’s model in the CPG.

4.4 Exception Handling

For a security analysis, a suitable representation of the exception handling is necessary to identify weaknesses which originate from a missing or incorrect handling of errors. Since LLVM-IR offers a rich system for exception handling, this information should also be included in the CPG. The CPG represents exception handling routines with try-catch statements. To make the LLVM-IR fit into this pattern, we need to identify which instructions form a try-block and which ones form a catch-block. Concerning the try-block, we represent the invoke instruction as a try-block surrounding a function call and a goto-statement.

For the catch-blocks, however, such a straightforward model is not possible. In LLVM, the catchswitch instruction selects a matching catchpad based on the signature of the catchpad-instruction of a basic block. The catchpad contains the code of the catch-block and is ended by a catchret instruction. However, the matching and signature cannot easily be transferred to a high-level name. Therefore, we model this construct as a catch-block which catches all exceptions and contains if-statements representing the signature matching. If none of them matches, the exception is thrown again. The remaining constructs such as the cleanuppad and its cleanupret instruction are not modeled specifically.

Another way to mark a catch-block is the landingpad-instruction which, again, filters for the right object to catch. Once more, the matching is specific to the programming language and thus, modelling this is left to future work. If we cannot translate the instructions to concepts supported by the CPG, we model them as special functions similar to the LLVM intrinsics.

5 LLVM-Specific CPG Passes

The frontend parses and translates the code linearly with the exception of handling the \(\varphi \) statement. This, however, leads to a state where some of the nodes generated by the frontend serve as intermediate steps and can therefore be optimized after having processed the whole code base. In particular, as we will show, we inline unconditional jump targets and improve the generated catch-blocks. This clean-up phase takes place in a pass over the CPG nodes and serves to increase the stability and scalability of subsequent security analyses.

First, none of the conditional jumps (if-statements) and switch/case-statements incorporates a meaningful body of statements. Instead, they are modeled as goto statements to other basic blocks (BB). E.g., after the translation, an if-statement looks as follows: if(cond){goto BB1;}else{goto BB2;}. The pass identifies all BBs which have only a single predecessor and replaces the respective goto-statement with the BB. Note that we do not perform this transformation if multiple predecessors exist because it would increase the number of nodes.

Second, the pass removes the instructions which serve as intermediate steps during the generation of catch-blocks. Recall that a catch-block is initially modeled by catching all types of exceptions or objects, since, e.g. when translating the invoke instruction, we do not yet know which exceptions are handled inside the block. However, in the case of a landingpad instruction, we have a single object which is caught and the respective instruction is at the beginning of the generic catch-block. Therefore, we can refine the object which is caught and remove the intermediate step. For other exception handling techniques, in turn, we have empty throw statements which should propagate the exception caught by the generic catch block if none of the switch/case-statements match the actual object. Since we can now flatten the graph by removing unnecessary indirections (e.g., unconditional jumps as above, or simply wrappers around statements which had been introduced by the line-by-line translation), we can easily propagate the object caught by the generic catch-block to the throw statements.

As we explicitly aim to handle lifted or decompiled code, a second pass can remove method stubs, i.e., methods whose only purpose is to call a library method. The main purpose of this pass is to simplify subsequent analyses.

6 Experimental Evaluation

To reuse the same analyses for the graphs constructed from source code as well as the ones containing LLVM-IR, we carefully designed the translation in a way to mimic the concepts used in source code as closely as possible. In this section, we first show a case study which advocates that we can reuse queries that aim to identify security concerns in source code to query LLVM-IR. Second, we test the implementation against the Rust standard library to show the applicability of the approach to large-scale projects. All measurements were performed on a Ubuntu 20.04 running on an Intel i5-6200U CPU and 20 GB of RAM.

Table 1. Results for detecting misuse of cryptographic libraries.

Full size table

6.1 Case Study: Cryptographic Misuse

This case study is driven by the anticipated usages of the CPG on LLVM-IR. First, it should enable a security analysis of the LLVM-IR without the need to rewrite existing analyses. Second, it should be scalable by introducing a minimal number of nodes. The toolchain should be able to operate on LLVM-IR emitted during the compilation of a program (subsequently, we call this “compiled LLVM-IR”) or when lifting a binary (we call this “lifted LLVM-IR”). To show that these properties are fulfilled, we 1) compare the sizes of graphs retrieved from compilers and lifters, 2) compare the runtime of the analysis, and 3) show that the weakness can be identified with the same analysis in all samples.

We implemented a TLS-client in C++ which uses the openssl library. It accepts the insecure hashing algorithm MD5 as one of the options. First, we tested the toolchain against the original cpp file, which identified the respective issue. Next, we used XCode on macOS with the M1 chip and clang on Ubuntu running on a x64 CPU to emit the LLVM-IR which can be retrieved during compilation. As LLVM-IR also serves as target LLVM-IR for many lifters, we lifted binaries of the test file which had been compiled on the Mac and on Ubuntu with various compilers. We use RetDec [10] to lift the binaries to LLVM-IR and also decompiled them to a C-style file^{Footnote 10}. Table 1 summarizes the analysis time, how many nodes and functions are included in the graph and if the problem could be found successfully. We discuss the observations in the following paragraphs.

Size of the Graphs. One of our goals is to keep the sizes of the graph small. Therefore, we compare the size of the graphs retrieved from compiled and lifted LLVM-IR and when decompiling a binary file.

One observation is the significant increase in functions contained in the LLVM-IR compared to the original C file. This can be explained by stubs introduced by the compiler. Note, however, that RetDec seems to remove some of the functions which have been introduced during compilation. This reduction facilitates and speeds up a subsequent security analysis on the resulting graph.

Not only does RetDec reduce the number of functions contained in the binary but it also reduces the number of nodes compared to compiled LLVM-IR. This observation is in-line with recent research which found that some lifters, including RetDec, can reduce the complexity of the code represented by LLVM-IR as well as the number of elements it contains [14] while keeping the main functionality of the code available. The authors further observed that RetDec’s output is not suitable for recompiling in most cases. However, as the CPG library aims to handle incomplete, non-compilable and to a certain extent even incorrect code, this limitation should not affect the representation and further analysis.

Compared to the lifted LLVM-IR, the decompiled C files contain more functions but less nodes. This is explained by the possibility to summarize multiple LLVM-IR instructions in a single C statement. Overall, the reduction of nodes can be explained by RetDec’s passes which aim to eliminate unnecessary code.

Runtime of the Analysis. We ran the translation to the CPG and the bug detection query 100 times for each of the files and report the average runtimes in Table 1. First, it is interesting to note that the analysis time of the decompiled files is comparable to the one of the original cpp-file. The reduced number of nodes explains the speedup in some cases. The overall analysis time for the LLVM-IR files is ranging between 0.74 to 11.1 times the time of the original file. It is notable that the graphs retrieved from the LLVM-IR files contain 2.8 to 16.1 times the amount of nodes of the original file and still the runtime improved.

Identification of Weaknesses. To detect the misconfiguration in the test file, we implemented a query to identify the arguments of calls to the function SSL_CTX_set_cipher_list. To implement this analysis, we use the constant propagation implemented in the analysis module included in the CPG library^{Footnote 11}.

With the query, we are able to identify the flaw in the original C file and in the compiled and lifted LLVM-IR files. However, when decompiling the binary compiled on macOS using the M1 chip, we failed to identify the misuse. We manually investigated the case and found that the CDT library^{Footnote 12} which the CPG library uses for parsing the C file fails to identify the name of a field. Therefore, the data flow between the field and the method call is not resolved.

Stability of the Translation. All samples could be represented in the CPG without crashes. However, the LLVM-IR retrieved during compilation of a program contains a much richer semantics and uses various different instructions. This results in warnings, some of which show that nested instructions are not yet handled. The other ones indicate that a different scoping for variables in a try-catch block is expected because LLVM-IR’s scoping differs to other languages.

Table 2. Performance when analyzing Rust libraries.

Full size table

6.2 Application to the Rust Runtime

To assess the applicability to real-world programs, we retrieved the LLVM-IR from the standard and core libraries of Rust. We chose Rust since it is not yet supported by the CPG implementation and provides the option to compile to LLVM-IR. Overall, the test set includes 21 distinct LLVM files which are listed in Table 2 together with their size and the results. We report the time it took to translate the file (including various CPG passes) as well as the number of nodes which could not be parsed accurately. For the latter, we need to extend the LLVM-specific translation to include more cases of “nested” LLVM expressions.

Stability. We want to assess the maturity level of the translation step against a large and unknown codebase consisting of a total of 428, 777 lines of LLVM-IR. To measure this, the graph includes specific nodes, called ProblemNode, for each expression which could not be parsed correctly. While we handle all types of instructions, some arguments of the instructions can be computed in line by type casts, or simple arithmetic operations, among others. Overall, we could observe 7, 836 of such ProblemNodes, which accounts for \(0.60\%\) of all 1, 299, 234 nodes. This result is encouraging and indicates that the current implementation is already capable of handling the vast majority of all combinations of statements^{Footnote 13}.

The fraction of nodes which cannot be handled differs significantly among the samples and ranges between \(0\%\) to \(1.05\%\). Larger files are more likely to lead to an error during the translation. In addition, it is possible that the varying amount of complexity of the code could trigger more errors. To validate this, we set the average number of CPG nodes per line of code as complexity of the LLVM instructions. Among the samples, this ratio ranges between 1.22 and 3.67.

We plot this relation in Fig. 3. Neither of the graphs gives a strong indication for this idea since the error rates seem to be randomly distributed for all non-trivial samples. Neither the size nor the complexity of the samples lead to a conceptual limitation. Instead, some samples use unsoppurted expressions more frequently which can easily be addressed in the implementation.

Scalability. Another goal is to assess the scalability of the implementation on real-world software with many lines of code. Two factors can impact the analysis time: The lines of code and the number of nodes in the graph. According to Table 2, an increase of LoC leads to more nodes in the graph in most cases. Figure 4 plots the time of the analysis (i.e., the translation to the CPG and all CPG passes but the ControlFlowSensitiveDFGPass) for the number of nodes. With the exception of one sample, the analysis time seems to grow linearly depending on the number of nodes in the graph. Interestingly, when we only consider the analysis time of the LLVM-specific translation and pass of the CPG, the outlier is no longer present. This shows that the LLVM-related translation and CPG pass do scale well even for larger samples but that some of the other CPG passes seem to perform poorly in the presence of a specific combination of nodes.

Comparison to Prior Work. To compare our approach to prior work which relied on LLVM’s reg2mem pass to remove \(\varphi \) nodes, we ran our toolchain but first executed the respective pass. As Figs. 5 and 6 show, our approach leads to a significant reduction of nodes and time required to generate the graph.

7 Discussion

Our evaluation suggests that our translation and CPG model can unify source code and low-level representations such as LLVM-IR in a single graph representation. This increases the reusability of analyses and queries on the graph.

We found that the LLVM-IR retrieved from binary lifters is significantly easier to handle by the graph. This is due to the fact that most lifters tend to use rather conservative steps for their translation. This results in the LLVM-IR being closer to assembly code with comparably simple types of instructions. The LLVM-IR retrieved during the compilation, in contrast, features numerous highly specialized instructions which typically make the translation more difficult. Furthermore, the graphs retrieved from lifted binaries are typically smaller than the ones which can be retrieved when the LLVM-IR is retrieved during the compilation. This makes it an interesting application since it simplifies and speeds up the analysis. Last, we found that the graph of the decompiled binary is only marginally smaller than the one holding the lifted LLVM-IR instructions. This small advantage will, however, not outweigh the error-prone and time-consuming decompliation step in most scenarios which is required to retrieve the code.

Validity of the Results. The main threat to the validity of the findings is the set of test samples. In particular, as we could see in Sect. 6.1, the compiler has a significant impact on the generated LLVM-IR and the resulting complexity which needs to be handled by our toolchain. Hence, testing the toolchain against different compilers and configurations might lead to different results. To address this potential issue, we used XCode on macOS and clang on Ubuntu, and we also generated the LLVM-IR with Rust’s crates build system. Furthermore, we used a binary lifter to showcase a possible application to such a scenario.

Limitations. As our evaluation against the Rust standard library showed, a small amount of instructions could not be parsed correctly. This is explained by the possibility of LLVM-IR to hold sub-statements for the arguments. While we do handle the concepts and operators (e.g., casts), their potential usage in a specific sub-statement needs to be added to the translation step. To identify all possible combinations, a more extensive testing is required.

Future Work and Research Directions. The resulting graph can be used as an entry point for further research to better include specifics of certain platforms. One example is the analysis of the LLVM-IR emitted by XCode for apps written in Apple’s programming languages Swift or Objective-C. Their calling conventions differ significantly from other programming languages. As an example, Objective-C makes use of a dynamic dispatching routine which requires extensive tracing of a method’s arguments to recover type information and the method name as a string [18, 64]. This information is present in the CPG but has to be combined to identify the calls. Similarly, it is necessary to model Swift’s calling conventions and memory model since it differs significantly from the one of C++ [65, 66]. However, to date, the differences are not fully explored. Future work should identify differences and integrate this knowledge into the CPG.

Furthermore, software written in C or C++ can rely on macros which are used similar to function calls in the source code and represented as such but are replaced with their specific implementation in LLVM-IR. This discrepancy needs to be addressed appropriately to better analyze such programs. In the current stage, addressing such inconsistencies between source code and the binary is left to manual efforts of the user of the cpg library. Additional efforts are necessary to reduce these manual efforts and ease the usability of the analysis toolchain.

Last, adapting the solution to the analysis of closed-source software is promising. Once such scenario is the analysis of cloud functions while auditing a configuration of the cloud deployment. In such scenarios, the source code might not be available to the auditor but an image (e.g., of a container) together with the compiled binary of the function can be accessed and analyzed. Hence, lifting those functions to LLVM-IR and finally using this representation for an analysis is promising to improve the analysis of cloud setups.

Recent research [14] showed that lifting is a stable technique for many applications. However, lifted or decompiled binaries still suffer from a lack of information which are crucial for a security analysis [8]. Hence, further research should study which gaps still exist to apply existing tools to lifted binaries.

Generalizability. Since the SSA form is also used by other IRs (e.g. Shimple [67], WALA [63], SIL [68]), some of the challenges generalize to those IRs. Hence, the concepts presented in this paper can be reused to add further code representations using the SSA form to the CPG. Furthermore, some parts of our concept could be ported to other projects which suffer from similar issues. However, the applicability and impact depend on the projects’ data models.

8 Conclusion

We showed how we extended an open source CPG implementation to handle LLVM-IR. While the majority of instructions can easily be mapped to the high-level equivalents, the \(\varphi \) instruction and the LLVM exception handling instructions impose challenges to the translation. However, we could transform the program to the CPG representation with a reasonable increase in nodes while prior work suffered from huge performance penalties. The similarity between the resulting graph and the one of the code fractions in high-level languages allows to reuse existing analyses detecting security weaknesses or bugs. Our evaluation suggests that the approach scales to larger projects. Future work is necessary to include characteristics of some programming languages (e.g. Swift), to add analyses for further use cases, such as analyzing closed-source cloud functions, binaries and side-channel vulnerabilities, and to study the gaps of binary lifting.

Notes

1.
https://clang.llvm.org/.
2.
https://github.com/apple/llvm-project.
3.
The cmpxchg instruction compares a given argument against a value stored in a memory address. If they are equal, a new value, specified in a second argument is stored in memory. This is similar to if(*addr == arg0) *addr = arg1; in C/C++.
4.
https://github.com/Fraunhofer-AISEC/cpg.
5.
https://neo4j.com/.
6.
https://github.com/bytedeco/javacpp-presets/tree/master/llvm.
7.
Typically, an incorrect API call leads to a segfault.
8.
https://llvm.org/doxygen/Reg2Mem_8cpp_source.html.
9.
For all other variables, the statement of the assignment performs the declaration.
10.
We compiled a custom version of RetDec to update the disassembler and support the endbr64 instruction which had not been supported at the time of the experiments.
11.
https://github.com/Fraunhofer-AISEC/cpg/tree/master/cpg-analysis.
12.
https://www.eclipse.org/cdt/.
13.
We will manually investigate the ProblemNodes to parse the statements in the future.

References

Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: IEEE Symposium on Security and Privacy (2014)
Google Scholar
Xiaomeng, W., Tao, Z., Runpu, W., Wei, X., Changyu, H.: Cpgva: code property graph based vulnerability analysis by deep learning. In: International Conference on Advanced Infocomm Technology (2018)
Google Scholar
Keirsgieter, W.: Graft. https://github.com/wimkeir/graft
Effendi, D.B., et al.: Plume. https://github.com/plume-oss/plume
Joern - the bug hunter’s workbench. https://github.com/joernio/joern
Banse, C., Kunz, I., Schneider, A., Weiss, K.: Cloud property graph: connecting cloud security assessments with static code analysis. In: IEEE International Conference on Cloud Computing (2021)
Google Scholar
Weiss, K., Banse, C.: A language-independent analysis platform for source code. (2022)
Google Scholar
Mantovani, A., Compagna, L., Shoshitaishvili, Y., Balzarotti, D.: The convergence of source code and binary vulnerability discovery - a case study. In: ASIA Conference on Computer and Communications Security (2022)
Google Scholar
Llvm language reference manual. https://llvm.org/docs/LangRef.html
RetDec: a retargetable machine-code decompiler. https://retdec.com/
Dinaburg, A., Ruef, A.: Mcsema: static translation of x86 instructions to LLVM. In: ReCon Conference (2014)
Google Scholar
Yadavalli, S.B., Smith, A.: Raising binaries to LLVM IR with McToll (Wip paper). In: 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (2019)
Google Scholar
Galois Inc.: reopt. https://github.com/GaloisInc/reopt
Liu, Z., Yuan, Y., Wang, S., Bao, Y.: Sok: demystifying binary lifters through the lens of downstream applications. In: Symposium on Security and Privacy (2022)
Google Scholar
Balakrishnan, G., Reps, T.: Wysinwyx: what you see is not what you execute. ACM Trans. Program. Lang. Syst. 32(6), 202–213 (2010)
Google Scholar
Jancar, J., et al.: “they’re not that hard to mitigate”: what cryptographic library developers think about timing attacks. In: IEEE Symposium on Security and Privacy (2022)
Google Scholar
Click, C., Paleczny, M.: A simple graph-based intermediate representation. In: ACM SIGPLAN Workshop on Intermediate Representations (1995)
Google Scholar
Schütte, J., Titze, D.: LIOS: Lifting IOS apps for fun and profit. In: International Workshop on Secure Internet of Things (2019)
Google Scholar
Denisov, A., Yamaguchi, F.: LLVM meets code property graphs. https://blog.llvm.org/posts/2021-02-23-llvm-meets-code-property-graphs/
Keirsgieter, W., Visser, W.: Graft: Static analysis of java bytecode with graph databases. In: Conference of the South African Institute of Computer Scientists and Information Technologists (2020)
Google Scholar
Urma, R.G., Mycroft, A.: Source-code queries with graph databases-with application to programming language usage and evolution. Sci. Comput. Program 97, 127–134 (2015)
Google Scholar
Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. In: Annual Computer Security Applications Conference (2012)
Google Scholar
Yamaguchi, F., Maier, A., Gascon, H., Rieck, K.: Automatic inference of search patterns for taint-style vulnerabilities. In: IEEE Symposium on Security and Privacy (2015)
Google Scholar
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: ACM Workshop on Artificial Intelligence and Security (2013)
Google Scholar
Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: International Conference on Software Maintenance (1998)
Google Scholar
Rodriguez, M.A.: The gremlin graph traversal machine and language (invited talk). In: Symposium on Database Programming Languages (2015)
Google Scholar
Lam, M.S., et al.: Context-sensitive program analysis as database queries. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (2005)
Google Scholar
Caracciolo, A., Chis, A., Spasojevic, B., Lungu, M.: Pangea: a workbench for statically analyzing multi-language software corpora. In: Working Conference on Source Code Analysis and Manipulation (2014)
Google Scholar
Flores, E., Barrón-Cedeño, A., Moreno, L., Rosso, P.: Cross-language source code re-use detection using latent semantic analysis. J. Univers. Comput. Sci. 21, 1708–1725 (2015)
Google Scholar
Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.: Towards the detection of cross-language source code reuse. In: International Conference on Application of Natural Language to Information Systems (2011)
Google Scholar
Angerer, F.: Variability-aware change impact analysis of multi-language product lines. In: International Conference on Automated Software Engineering (2014)
Google Scholar
Mushtaq, Z., Rasool, G., Shehzad, B.: Multilingual source code analysis: a systematic literature review. IEEE Access 5, 11307–11336 (2017)
Article Google Scholar
Mayer, P., Schroeder, A.: Cross-language code analysis and refactoring. In: Working Conference on Source Code Analysis and Manipulation (2012)
Google Scholar
Schiewe, M., Curtis, J., Bushong, V., Cerny, T.: Advancing static code analysis with language-agnostic component identification. IEEE Access 10 (2022)
Google Scholar
Zügner, D., Kirschstein, T., Catasta, M., Leskovec, J., Günnemann, S.: Language-agnostic representation learning of source code from structure and context. In: ICLR (2021)
Google Scholar
Rakić, G., Budimac, Z., Savić, M.: Language independent framework for static code analysis. In: Balkan Conference in Informatics (2013)
Google Scholar
Strein, D., Kratz, H., Lowe, W.: Cross-language program analysis and refactoring. In: International Workshop on Source Code Analysis and Manipulation (2006)
Google Scholar
Teixeira, G., Bispo, J.a., Correia, F.F.: Multi-language static code analysis on the lara framework. In: ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (2021)
Google Scholar
Lattner, C., Adve, V.: Llvm: A compilation framework for lifelong program analysis & transformation. In: Symposium on Code Generation and Optimization (2004)
Google Scholar
Gimple. https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html
The hir. https://rustc-dev-guide.rust-lang.org/hir.html
ECMA, S.: Ecma-335 common language infrastructure (CLI), December 2002
Google Scholar
Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (2007)
Google Scholar
Vallee-Rai, R., Hendren, L.J.: Simple: simplifying java bytecode for analyses and transformations (1998)
Google Scholar
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: Bap: a binary analysis platform. In: Computer Aided Verification (2011)
Google Scholar
Dullien, T., Porst, S.: Reil: A platform-independent intermediate representation of disassembled code for static code analysis. In: CanSecWest (2009)
Google Scholar
ESIL. https://book.rada.re/disassembling/esil.html
Bardin, S., Herrmann, P., Leroux, J., Ly, O., Tabary, R., Vincent, A.: The BINCOA framework for binary code analysis. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 165–170. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_13
Chapter Google Scholar
David, R., et al.: BINSEC/SE: a dynamic symbolic execution toolkit for binary-level analysis. In: Conference on Software Analysis, Evolution, and Reengineering (2016)
Google Scholar
Klint, P., van der Storm, T., Vinju, J.: Rascal: a domain specific language for source code analysis and manipulation. In: International Working Conference on Source Code Analysis and Manipulation (2009)
Google Scholar
Facebook: Infer static analyzer. https://fbinfer.com/
Sonarqube – code quality and code security. https://www.sonarqube.org/
Checkmarx. https://checkmarx.com/product/cxsast-source-code-scanning/
Solar appscreener. https://solarappscreener.com/
Codacy. https://www.codacy.com/
Codeql. https://codeql.github.com/
Avgustinov, P., De Moor, O., Jones, M.P., Schäfer, M.: QL: Object-oriented queries on relational data. In: European Conference on Object-Oriented Programming (2016)
Google Scholar
De Moor, O., et al.: Keynote address: ql for source code analysis. In: Working Conference on Source Code Analysis and Manipulation (2007)
Google Scholar
Codechecker. https://codechecker.readthedocs.io/en/latest/
Coverity scan static analysis. https://scan.coverity.com/
Deepsource – fast and reliable static analysis platform. https://deepsource.io/
Lgtm.com. https://lgtm.com/
Wala. https://github.com/wala/WALA
Egele, M., Kruegel, C., Kirda, E., Vigna, G.: Pios detecting privacy leaks in IoS applications. In: Network and Distributed System Security Symposium (2011)
Google Scholar
Tiganov, D., Cho, J., Ali, K., Dolby, J.: Swan: a static analysis framework for swift. In: ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2020)
Google Scholar
Kraus, M., Haupert, V.: The swift language from a reverse engineering perspective. In: Reversing and Offensive-Oriented Trends Symposium (2018)
Google Scholar
Umanee, N.: Shimple: an investigation of static single assignment form. Master’s thesis, McGill University (2006)
Google Scholar
Groff, J., Lattner, C.: Swift intermediate language - a high level IR to complement LLVM (2015). https://llvm.org/devmtg/2015-10/#talk7

Download references

Acknowledgement

This work was partially funded by the Horizon 2020 project MEDINA, grant agreement ID 952633. We thank the reviewers and our shepherd Jianchang Lai for their suggestions improving this paper.

Author information

Authors and Affiliations

Fraunhofer AISEC, Garching, Germany
Alexander Küchler & Christian Banse

Authors

Alexander Küchler
View author publications
You can also search for this author in PubMed Google Scholar
Christian Banse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Küchler .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Willy Susilo
Xidian University, Xian, China
Xiaofeng Chen
University of Wollongong, Wollongong, NSW, Australia
Fuchun Guo
University of Wollongong, Wollongong, NSW, Australia
Yudi Zhang
Petra Christian University, Surabaya, Indonesia
Rolly Intan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Küchler, A., Banse, C. (2022). Representing LLVM-IR in a Code Property Graph. In: Susilo, W., Chen, X., Guo, F., Zhang, Y., Intan, R. (eds) Information Security. ISC 2022. Lecture Notes in Computer Science, vol 13640. Springer, Cham. https://doi.org/10.1007/978-3-031-22390-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-22390-7_21
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22389-1
Online ISBN: 978-3-031-22390-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Representing LLVM-IR in a Code Property Graph

Abstract