1 Introduction

Despite huge efforts of researchers and industry put into identifying vulnerable software, many software systems still suffer from various security weaknesses. The concept of code property graphs (CPG) [1] has been introduced to simplify identifying vulnerabilities and bugs in the source code of programs. A CPG covers properties of abstract syntax trees (AST), control flow graphs and data flow graphs, among others, thus containing all information relevant for a security analysis. The CPG enables its user to identify vulnerabilities or bugs by performing reusable graph queries. This perk led to a widespread adaption and several implementations [1,2,3,4,5,6,7]. Even if the graphs mimic the source code with a minimal loss of information, the graph provides an abstraction of the actual code. This abstraction is suitable to support a language-agnostic analysis of software. Unfortunately, the implementations are limited with respect to the supported programming languages since each language requires a separate translation.

As compilers suffer from a similar problem, the use of intermediate representations (IR) has become popular. The IR abstracts from the programming language but, in many cases, still contains a significant amount of high-level information such as the types of variables which is lost in the compiled binary and can barely be recovered [8]. The lack of such information can worsen the analysis results. The Low Level Virtual Machine (LLVM) project proposes LLVM-IR [9], a very popular IR. Numerous compiler frontends exist to translate source code to LLVM-IR. E.g., clangFootnote 1 translates the languages C, C++ and Objective-C to LLVM-IR and has been extended by Apple to support SwiftFootnote 2. Other frontends exist to support a wide range of programming languages (e.g., Rust).

While LLVM-IR was designed for compilers, it is also frequently used as output format of binary lifters [10,11,12,13]. This way, a lifter supporting multiple architectures and types of binary files can avoid the need to implement the translation for different flavors of assembly code or application binary interfaces. As binary lifting has meanwhile become a stable technique [14], its output can be used to perform a security analysis of the binary without requiring the sources.

If source code is available, analyzing the differences between the code and the binary can identify the security implications of compiler optimizations [15]. E.g., side channel vulnerabilities introduced by compilers are still a major concern of developers of cryptographic libraries [16]. As LLVM-IR can be emitted at all stages of the compilation, it can already contain such vulnerabilities.

In this paper, we show how we overcome shortcomings of existing CPG tools by enabling the analysis of LLVM-IR in such a graph. This bridges the gap between the analysis of source code written in higher-level programming languages and the analysis of programs (or dependencies) that may exist only in binary form. When including support for LLVM-IR in a CPG, several challenges arise from the static single assignment (SSA) form, the exception handling routine, instructions missing in high-level programming languages and significantly different syntactic representations of some concepts in LLVM-IR and other languages.

Contrary to prior work, we do not require to run any LLVM passes beforehand, helping us to keep the graph smaller. At the same time, we aim to retrieve as much high-level information as possible and map the code to high-level concepts whenever possible. Rather than handling LLVM-IR-specific instructions, e.g., cmpxchgFootnote 3, only as a generic function call, we translate the concepts into existing CPG node types that represent the behavior of a higher-level programming language. This allows us to re-use existing concepts in queries, such as if-statements or pointer referencing. Our case-study shows that our integration of LLVM-IR into a CPG enables us to identify misuses of cryptographic APIs, a common security weakness. It further shows that we support not only the analysis of programs whose source code is available, but also of lifted binaries, enabling us to identify weaknesses and vulnerabilities regardless of the source code’s availability. In summary, our contributions are as follows:

  • We present the first mapping of all LLVM-IR instructions to existing CPG nodes with full compatibility to the existing structure. This ensures that existing analyses are fully compatible with the representation.

  • We show how we can keep the size of the CPG minimal.

  • We are the first to include LLVM-IR’s exception handling routines in a CPG.

  • We extended the open source project cpgFootnote 4 to support our concepts.

2 Background

2.1 The Code Property Graph

The cpg project [7] enables a graph-based representation of source code of different programming languages. To date, the focus lies in Java and C/C++ but experimental support for Python, Go and TypeScript is also available. The goal of the project is to provide a language-agnostic representation of the source code. This enables a security expert to identify vulnerabilities or bugs. Furthermore, the cpg library comprises a way to store the graph in neo4jFootnote 5, and makes the graph accessible via a command line interface. For some cases, the library can also evaluate the value which can be held by a node. All this allows a security expert to write custom queries either to the graph database or the in-memory representation of the CPG. The cpg library is designed in a way to allow reusing these queries among all supported programming languages. To fulfill this goal, the cpg library implements a thorough class hierarchy which accounts for various types of statements and expressions. The CPG encodes information such as the class hierarchy of the code under analysis, the control flow graph, and the call graph in a single graph. The current design mainly targets object-oriented programming languages. To deal with a possible lack of some code fragments or errors in the code, the library is resilient to incomplete, non-compilable and to a certain extent even incorrect code.

2.2 The LLVM Intermediate Representation

The Instructions.  The LLVM-IR is used as IR of the LLVM project. Its main purpose lies in providing an abstraction of code to ease the optimization and analysis of the program in a language- and architecture-independent way. The LLVM-IR holds values in global variables (prefixed with @) and local variables (prefixed with %) both of which can be named or unnamed. The LLVM-IR follows the static single assignment (SSA) form. Hence, every variable can be written to exactly once. This limitation does not affect global variables as they are represented as memory locations and are accessed via store or load operations.

Overall, the LLVM-IR differentiates between 65 instructions. Of these, 13 are arithmetic operations, 6 are bitwise operations, and 13 instructions cast types. The remaining instructions call functions, handle exceptions, load from or store to memory, manipulate aggregated types or jump to other program locations. The instructions can be enhanced with metadata to note the calling convention, optimizations or desired properties of functions and parameters, among others.

Besides the basic instructions, LLVM-IR contains numerous so-called “intrinsics”. Those are functions which model certain standard library functionality, or model frequent actions which have to be represented differently on different architectures. Some intrinsics repeat or refine basic instructions, others insert functionality such as the automated memory management in Objective-C.

The LLVM-IR supports a simple type system and differentiates between a set of primitive types and aggregated types such as structs, arrays and vectors. Additionally, LLVM-IR has a type for labels (i.e., jump targets), metadata and a so-called token which is used by certain instructions to transport information. Overall, the type system resembles C rather than object-oriented programming languages. In fact, object-oriented concepts are handled by the respective language frontend in LLVM. The frontend translates the object-oriented properties to concepts such as VTables for overriding methods, and method name mangling to support overloaded functions. In the case of Objective-C, it uses the dynamic dispatching strategy. Other languages make use of similar concepts.

Accessing LLVM-IR.  The LLVM project offers a C++ and a C API to parse LLVM-IR and LLVm bitcode files. As the CPG project is mainly implemented in Java, access to the API has to be provided via the Java Native Interface (JNI). We use the open source project javacpp-presetsFootnote 6 which provides access to the C API via JNI. Unfortunately, the C API has a flat type hierarchy in its functions to access the LLVM-IR’s AST, thus making the parsing of instructions and the extraction of their elements more error-prone if not parsed correctlyFootnote 7. However, as our evaluation in Sect. 6 shows, our implementation works in a stable way.

3 Related Work

Code Property Graphs.  Researchers and industry proposed multiple use cases and implementations of CPGs and analysis tools [1,2,3,4,5,6,7, 17, 18]. All of these tools differ in their support for programming languages.

Closest to our work is the tool llvm2cpg [19] which uses Joern [5] as graph representation. The respective CPG represents most instructions as function calls and does not try to infer any of the high-level information. Furthermore, it uses the reg2mem LLVM pass to address the \(\varphi \) instruction of LLVM-IR, which significantly increases the code base. This results in additional instructions present in the graph and thus slows down the analysis and makes it more error-prone.

liOS [18] constructs a CPG holding assembly instructions and the function bodies lifted to LLVM-IR to analyze iOS apps. The graph model cannot be used to represent source code. Furthermore, liOS does not specifically address LLVM-IR instructions since the analyses mainly operate on assembly code.

Plume [4] and Graft [3, 20] only support Java bytecode, a different low-level language. Plume builds the graph incrementally to analyze data flows and has been merged into Joern in a revised version. Graft follows a similar goal. Other tools [1, 5, 7, 17] analyze source code and differ in their level of abstractions and supported languages. Some tools extend CPGs for specific use cases, e.g., analyzing cloud apps [6] or finding vulnerabilities with deep learning [2].

Graph-Based Security Analysis.  Various other works investigated in the usage of other graph-based representations of the source code to identify bugs or vulnerabilities [21,22,23] or similar code fragments [24, 25], traverse the graph [26] or improve the analysis [27]. These works aim to provide a rich basis for analyzing the graphs. Many of the proposed techniques operate on other graph structures (e.g. the AST). However, the CPG combines a multitude of information and includes the respective relations, thus making the required information available for the analysis. Hence, these approaches can still be applied to the CPG.

Static Analysis of Multiple Programming Languages.  Other works target the analysis of multiple programming languages [28,29,30,31,32,33]. Some of the frameworks rely on language-agnostic ASTs [34, 35] or aim to provide a common pattern for the AST of multiple languages [36, 37]. However, ASTs are cannot be used to find all kinds of bugs as they do not contain the required information [1]. Teixeira et al. [38] even propose to translate source code to a custom language.

Furthermore, various intermediate representations (IRs) have been proposed either for compilers (e.g., LLVM [39], GIMPLE [40], HIR [41], or CIL [42]), or specifically targeting code analysis (e.g. VEX IR [43], jimple [44], BIL [45], REIL [46], ESIL [47], DBA [48, 49] or RASCAL [50]). Since the IRs are often tailored to a specific use case or language, they differ in the information available in the instructions and their abstractions. Many of the IRs are integrated in analysis toolchains whose analyses are often specific to a use case and cannot easily be ported to other tools. Therefore, integrating such IRs in an abstract analysis platform like the CPG can enable further generalized security analysis.

Numerous tools [51,52,53,54,55,56,57,58,59,60,61,62,63] can analyze multiple programming languages. However, they can often barely share the analyses between the languages. The CPG representation allows reusing analyses across languages.

4 Enabling the Analysis of LLVM-IR with Code Property Graphs

Existing graph-based approaches to security analysis mostly focus on the analysis of programs with available source code. In this section, we present our approach which also enables static security analysis of lifted binaries using CPGs. E.g., this allows us to detect the misuse of cryptographic functions, as shown in our evaluation (Sect. 6.1). In detail, we describe how we map different LLVM-IR instructions to CPG nodes while reusing only existing node types. We show how to represent LLVM-specific constructs similar to their equivalents in already supported programming languages. We show how to model 1) arithmetic and logical instructions, 2) access to aggregate types, 3) the \(\varphi \) instruction and 4) LLVM-IR’s exception handling routine with a minimal increase of nodes.

4.1 Basic Instructions

Many instructions are known from other programming languages. We can coarsely differentiate between arithmetic and logical operations, operations which enforce specific interpretations of types, and operations which are composed of numerous steps but are often performed atomically on the CPU. In this section, we explain how we include those respective instructions in the CPG.

Almost all programming languages have a common subset of instructions or operations. This includes arithmetic, bitwise and logic operations, or comparisons which we map to their representation in high-level languages (+, -, *, /, %, \(\hat{ }\) , &, |, , etc.). Other instructions like jumps, calls, return instructions are modeled with their representation in C code. For if- and switch/case-statements, the branches or cases contain a simple goto statement. Later, a CPG pass removes such indirections whenever possible to reduce the size of the graph.

For some instructions, LLVM-IR can enforce a specific interpretation of the types of the arguments. E.g., the instructions udiv, sdiv and fdiv represent a division and are mapped to the binary operator /. Yet, they interpret the values as unsigned (udiv), signed (sdiv) or floating point value (fdiv). In the CPG, we add typecasts to the arguments to enforce the correct interpretation. In the context of a security analysis, such information helps to determine the type and signedness of values for detecting numerical errors (e.g., integer overflows).

In addition, some comparators of floating point values check if a number is ordered or not (i.e., if it is NAN). We split these comparisons into a check if the number is ordered and the actual comparison. E.g., the comparators ult and olt compare two floating point values and are mapped to the operator. However, the ult comparison checks if a value a is unordered or less than value b and thus is modeled as the statement . Similarly, we model the olt comparison with .

Some of LLVM’s instructions like atomicrmw and cmpxchg are known from assembly code rather than high-level languages and perform multiple operations atomically. The cmpxchg instruction loads a value from memory and replaces it with an operand if the value equals another operand. In the CPG, we model this by a block of statements holding the comparison, an if statement and the assignment in the then-branch. We annotate the block to keep the information that all this is performed atomically. Similarly, we model atomicrmw as a block of statements performing a load, an optional comparison and if-statement and an assignment to a variable. By modeling these instructions with a representation similar to source code, we simplify subsequent analyses. In contrast, prior work [19] models these instructions as a call to custom functions.

4.2 Handling Aggregate Types

High-level languages provide syntactic means to access elements of complex types like arrays, structs or objects. In LLVM-IR, arrays and structs are still available and their values can be accessed by special instructions. Since operations on such types offer a frequent attack surface (e.g., for memory corruptions), an accurate modeling of aggregated types and handling the access to their elements is relevant for a security analysis.

Fig. 1.
figure 1

The graph representing the insertvalue instruction. We can see the literal struct which is generated as well as the access to the field.

For arrays which are represented as a vector, the instructions extractelement and insertelement provide access to the elements. Both instructions are represented as an ArraySubscriptionExpression in the CPG, one being the left-hand side of the assignment and one the right-hand side. For all other aggregate types, the instructions getelementptr, extractvalue, and insertvalue model the access to the element either by the index inside an array or by the position of a field inside a structure. The code %b = insertvalue i32, i8 %a, i8 7, 1 shows how the second element of the variable a is set to 7. We model the instruction as a copy of a to the variable b and an assignment of the value 7 to the accessed field_1. Figure 1 shows the resulting graph with the initialization of b on the bottom right, and the access to the field on the left.

The example uses an interesting concept of LLVM-IR: a so-called literal structure, a struct whose layout is defined in the instruction. For such structs, we generate a type which is identified by the types of its fields. Hence, all literal structs with the same list of fields are regarded as the same type. In our example, the struct is named literal_i32_i8 and has the fields field_0 of type i32 and field_1 of type i8. The top left of Fig. 1 shows the declaration of the type. While the instructions insertvalue or extractvalue read or write values from memory, it is sometimes desirable to retrieve a pointer to an element of a structure. For this case, the instruction getelementptr computes a memory address without accessing memory. Listing 1.1 illustrates the usage of this instruction on a named struct. Listing 1.2, in turn, shows the same code written as C. Figure 2a shows the definition of the named struct and the connections between the fields for the graph retrieved from LLVM-IR. The result is remarkably similar to the graph in Fig. 2b which represents the C code. This similarity lets us reuse existing analyses for the graphs retrieved from LLVM-IR and shows that the graphs are structurally identical. In fact, the relations between variables and fields could be better resolved which can lead to improved analysis results.

figure e
figure f
Fig. 2.
figure 2

LLVM-IR using getelementptr and equivalent C code result in structurally nearly identical CPGs. The CPG contains structs (light pink) and their fields (beige), constants (green), the method argument (yellow) and return statement (purple). The elements in arrays (brown) and fields are accessed (dark pink). (Color figure online)

4.3 The \(\varphi \)-Instruction

The SSA form enforces that each variable is assigned exactly once in LLVM-IR. However, in some cases, it is required to assign a value multiple times. A frequent example is a loop counter which is set before executing the loop and is updated on each iteration. To allow such behavior without duplicating code and without storing the values in memory, the \(\varphi \)-instruction is used. It assigns the target variable one of the inputs based on the previously executed basic block (BB).

As most programming languages do not have such an instruction, there is no fitting node to represent this in the CPG. To address this issue, prior work [19] relied on the LLVM reg2mem passFootnote 8 which translates the instruction to multiple load and store operations. However, this pass also transforms the access to other variables and thus significantly increases the size of the resulting CPG. As this reduces the scalability of subsequent analyses, we avoid this LLVM pass. We collect all \(\varphi \)-instructions during the translation. Finally, we parse the instructions to identify the predecessor BBs and add an assignment to the target variable at the end of the BB. To keep the CPG clean, we further insert a declaration of the variable at the beginning of the function containing the \(\varphi \)-instruction and all BBsFootnote 9. This, however, breaks the SSA form. The snippet in Listing 1.3 contains the \(\varphi \)-instruction while Listing 1.4 shows the function’s model in the CPG.

figure g

4.4 Exception Handling

For a security analysis, a suitable representation of the exception handling is necessary to identify weaknesses which originate from a missing or incorrect handling of errors. Since LLVM-IR offers a rich system for exception handling, this information should also be included in the CPG. The CPG represents exception handling routines with try-catch statements. To make the LLVM-IR fit into this pattern, we need to identify which instructions form a try-block and which ones form a catch-block. Concerning the try-block, we represent the invoke instruction as a try-block surrounding a function call and a goto-statement.

For the catch-blocks, however, such a straightforward model is not possible. In LLVM, the catchswitch instruction selects a matching catchpad based on the signature of the catchpad-instruction of a basic block. The catchpad contains the code of the catch-block and is ended by a catchret instruction. However, the matching and signature cannot easily be transferred to a high-level name. Therefore, we model this construct as a catch-block which catches all exceptions and contains if-statements representing the signature matching. If none of them matches, the exception is thrown again. The remaining constructs such as the cleanuppad and its cleanupret instruction are not modeled specifically.

Another way to mark a catch-block is the landingpad-instruction which, again, filters for the right object to catch. Once more, the matching is specific to the programming language and thus, modelling this is left to future work. If we cannot translate the instructions to concepts supported by the CPG, we model them as special functions similar to the LLVM intrinsics.

5 LLVM-Specific CPG Passes

The frontend parses and translates the code linearly with the exception of handling the \(\varphi \) statement. This, however, leads to a state where some of the nodes generated by the frontend serve as intermediate steps and can therefore be optimized after having processed the whole code base. In particular, as we will show, we inline unconditional jump targets and improve the generated catch-blocks. This clean-up phase takes place in a pass over the CPG nodes and serves to increase the stability and scalability of subsequent security analyses.

First, none of the conditional jumps (if-statements) and switch/case-statements incorporates a meaningful body of statements. Instead, they are modeled as goto statements to other basic blocks (BB). E.g., after the translation, an if-statement looks as follows: if(cond){goto BB1;}else{goto BB2;}. The pass identifies all BBs which have only a single predecessor and replaces the respective goto-statement with the BB. Note that we do not perform this transformation if multiple predecessors exist because it would increase the number of nodes.

Second, the pass removes the instructions which serve as intermediate steps during the generation of catch-blocks. Recall that a catch-block is initially modeled by catching all types of exceptions or objects, since, e.g. when translating the invoke instruction, we do not yet know which exceptions are handled inside the block. However, in the case of a landingpad instruction, we have a single object which is caught and the respective instruction is at the beginning of the generic catch-block. Therefore, we can refine the object which is caught and remove the intermediate step. For other exception handling techniques, in turn, we have empty throw statements which should propagate the exception caught by the generic catch block if none of the switch/case-statements match the actual object. Since we can now flatten the graph by removing unnecessary indirections (e.g., unconditional jumps as above, or simply wrappers around statements which had been introduced by the line-by-line translation), we can easily propagate the object caught by the generic catch-block to the throw statements.

As we explicitly aim to handle lifted or decompiled code, a second pass can remove method stubs, i.e., methods whose only purpose is to call a library method. The main purpose of this pass is to simplify subsequent analyses.

6 Experimental Evaluation

To reuse the same analyses for the graphs constructed from source code as well as the ones containing LLVM-IR, we carefully designed the translation in a way to mimic the concepts used in source code as closely as possible. In this section, we first show a case study which advocates that we can reuse queries that aim to identify security concerns in source code to query LLVM-IR. Second, we test the implementation against the Rust standard library to show the applicability of the approach to large-scale projects. All measurements were performed on a Ubuntu 20.04 running on an Intel i5-6200U CPU and 20 GB of RAM.

Table 1. Results for detecting misuse of cryptographic libraries.

6.1 Case Study: Cryptographic Misuse

This case study is driven by the anticipated usages of the CPG on LLVM-IR. First, it should enable a security analysis of the LLVM-IR without the need to rewrite existing analyses. Second, it should be scalable by introducing a minimal number of nodes. The toolchain should be able to operate on LLVM-IR emitted during the compilation of a program (subsequently, we call this “compiled LLVM-IR”) or when lifting a binary (we call this “lifted LLVM-IR”). To show that these properties are fulfilled, we 1) compare the sizes of graphs retrieved from compilers and lifters, 2) compare the runtime of the analysis, and 3) show that the weakness can be identified with the same analysis in all samples.

We implemented a TLS-client in C++ which uses the openssl library. It accepts the insecure hashing algorithm MD5 as one of the options. First, we tested the toolchain against the original cpp file, which identified the respective issue. Next, we used XCode on macOS with the M1 chip and clang on Ubuntu running on a x64 CPU to emit the LLVM-IR which can be retrieved during compilation. As LLVM-IR also serves as target LLVM-IR for many lifters, we lifted binaries of the test file which had been compiled on the Mac and on Ubuntu with various compilers. We use RetDec [10] to lift the binaries to LLVM-IR and also decompiled them to a C-style fileFootnote 10. Table 1 summarizes the analysis time, how many nodes and functions are included in the graph and if the problem could be found successfully. We discuss the observations in the following paragraphs.

Size of the Graphs.  One of our goals is to keep the sizes of the graph small. Therefore, we compare the size of the graphs retrieved from compiled and lifted LLVM-IR and when decompiling a binary file.

One observation is the significant increase in functions contained in the LLVM-IR compared to the original C file. This can be explained by stubs introduced by the compiler. Note, however, that RetDec seems to remove some of the functions which have been introduced during compilation. This reduction facilitates and speeds up a subsequent security analysis on the resulting graph.

Not only does RetDec reduce the number of functions contained in the binary but it also reduces the number of nodes compared to compiled LLVM-IR. This observation is in-line with recent research which found that some lifters, including RetDec, can reduce the complexity of the code represented by LLVM-IR as well as the number of elements it contains [14] while keeping the main functionality of the code available. The authors further observed that RetDec’s output is not suitable for recompiling in most cases. However, as the CPG library aims to handle incomplete, non-compilable and to a certain extent even incorrect code, this limitation should not affect the representation and further analysis.

Compared to the lifted LLVM-IR, the decompiled C files contain more functions but less nodes. This is explained by the possibility to summarize multiple LLVM-IR instructions in a single C statement. Overall, the reduction of nodes can be explained by RetDec’s passes which aim to eliminate unnecessary code.

Runtime of the Analysis.  We ran the translation to the CPG and the bug detection query 100 times for each of the files and report the average runtimes in Table 1. First, it is interesting to note that the analysis time of the decompiled files is comparable to the one of the original cpp-file. The reduced number of nodes explains the speedup in some cases. The overall analysis time for the LLVM-IR files is ranging between 0.74 to 11.1 times the time of the original file. It is notable that the graphs retrieved from the LLVM-IR files contain 2.8 to 16.1 times the amount of nodes of the original file and still the runtime improved.

Identification of Weaknesses.  To detect the misconfiguration in the test file, we implemented a query to identify the arguments of calls to the function SSL_CTX_set_cipher_list. To implement this analysis, we use the constant propagation implemented in the analysis module included in the CPG libraryFootnote 11.

With the query, we are able to identify the flaw in the original C file and in the compiled and lifted LLVM-IR files. However, when decompiling the binary compiled on macOS using the M1 chip, we failed to identify the misuse. We manually investigated the case and found that the CDT libraryFootnote 12 which the CPG library uses for parsing the C file fails to identify the name of a field. Therefore, the data flow between the field and the method call is not resolved.

Stability of the Translation.  All samples could be represented in the CPG without crashes. However, the LLVM-IR retrieved during compilation of a program contains a much richer semantics and uses various different instructions. This results in warnings, some of which show that nested instructions are not yet handled. The other ones indicate that a different scoping for variables in a try-catch block is expected because LLVM-IR’s scoping differs to other languages.

Table 2. Performance when analyzing Rust libraries.

6.2 Application to the Rust Runtime

To assess the applicability to real-world programs, we retrieved the LLVM-IR from the standard and core libraries of Rust. We chose Rust since it is not yet supported by the CPG implementation and provides the option to compile to LLVM-IR. Overall, the test set includes 21 distinct LLVM files which are listed in Table 2 together with their size and the results. We report the time it took to translate the file (including various CPG passes) as well as the number of nodes which could not be parsed accurately. For the latter, we need to extend the LLVM-specific translation to include more cases of “nested” LLVM expressions.

Fig. 3.
figure 3

Relation between lines of code, nodes in the CPG and the fraction of ProblemNodes. For non-trivial samples, the error-rates are randomly distributed.

Stability.  We want to assess the maturity level of the translation step against a large and unknown codebase consisting of a total of 428, 777 lines of LLVM-IR. To measure this, the graph includes specific nodes, called ProblemNode, for each expression which could not be parsed correctly. While we handle all types of instructions, some arguments of the instructions can be computed in line by type casts, or simple arithmetic operations, among others. Overall, we could observe 7, 836 of such ProblemNodes, which accounts for \(0.60\%\) of all 1, 299, 234 nodes. This result is encouraging and indicates that the current implementation is already capable of handling the vast majority of all combinations of statementsFootnote 13.

The fraction of nodes which cannot be handled differs significantly among the samples and ranges between \(0\%\) to \(1.05\%\). Larger files are more likely to lead to an error during the translation. In addition, it is possible that the varying amount of complexity of the code could trigger more errors. To validate this, we set the average number of CPG nodes per line of code as complexity of the LLVM instructions. Among the samples, this ratio ranges between 1.22 and 3.67.

We plot this relation in Fig. 3. Neither of the graphs gives a strong indication for this idea since the error rates seem to be randomly distributed for all non-trivial samples. Neither the size nor the complexity of the samples lead to a conceptual limitation. Instead, some samples use unsoppurted expressions more frequently which can easily be addressed in the implementation.

Fig. 4.
figure 4

Analysis time vs. # Nodes. Note the logarithmic y scale.

Scalability.  Another goal is to assess the scalability of the implementation on real-world software with many lines of code. Two factors can impact the analysis time: The lines of code and the number of nodes in the graph. According to Table 2, an increase of LoC leads to more nodes in the graph in most cases. Figure 4 plots the time of the analysis (i.e., the translation to the CPG and all CPG passes but the ControlFlowSensitiveDFGPass) for the number of nodes. With the exception of one sample, the analysis time seems to grow linearly depending on the number of nodes in the graph. Interestingly, when we only consider the analysis time of the LLVM-specific translation and pass of the CPG, the outlier is no longer present. This shows that the LLVM-related translation and CPG pass do scale well even for larger samples but that some of the other CPG passes seem to perform poorly in the presence of a specific combination of nodes.

Fig. 5.
figure 5

Comparison of the analysis time of our approach and prior work for each file of Table 2. Our analysis typically finishes much faster.

Fig. 6.
figure 6

Comparison of the number of nodes of our approach and prior work for each file of Table 2. Our CPG contains fewer nodes.

Comparison to Prior Work.  To compare our approach to prior work which relied on LLVM’s reg2mem pass to remove \(\varphi \) nodes, we ran our toolchain but first executed the respective pass. As Figs. 5 and 6 show, our approach leads to a significant reduction of nodes and time required to generate the graph.

7 Discussion

Our evaluation suggests that our translation and CPG model can unify source code and low-level representations such as LLVM-IR in a single graph representation. This increases the reusability of analyses and queries on the graph.

We found that the LLVM-IR retrieved from binary lifters is significantly easier to handle by the graph. This is due to the fact that most lifters tend to use rather conservative steps for their translation. This results in the LLVM-IR being closer to assembly code with comparably simple types of instructions. The LLVM-IR retrieved during the compilation, in contrast, features numerous highly specialized instructions which typically make the translation more difficult. Furthermore, the graphs retrieved from lifted binaries are typically smaller than the ones which can be retrieved when the LLVM-IR is retrieved during the compilation. This makes it an interesting application since it simplifies and speeds up the analysis. Last, we found that the graph of the decompiled binary is only marginally smaller than the one holding the lifted LLVM-IR instructions. This small advantage will, however, not outweigh the error-prone and time-consuming decompliation step in most scenarios which is required to retrieve the code.

Validity of the Results.  The main threat to the validity of the findings is the set of test samples. In particular, as we could see in Sect. 6.1, the compiler has a significant impact on the generated LLVM-IR and the resulting complexity which needs to be handled by our toolchain. Hence, testing the toolchain against different compilers and configurations might lead to different results. To address this potential issue, we used XCode on macOS and clang on Ubuntu, and we also generated the LLVM-IR with Rust’s crates build system. Furthermore, we used a binary lifter to showcase a possible application to such a scenario.

Limitations.  As our evaluation against the Rust standard library showed, a small amount of instructions could not be parsed correctly. This is explained by the possibility of LLVM-IR to hold sub-statements for the arguments. While we do handle the concepts and operators (e.g., casts), their potential usage in a specific sub-statement needs to be added to the translation step. To identify all possible combinations, a more extensive testing is required.

Future Work and Research Directions.  The resulting graph can be used as an entry point for further research to better include specifics of certain platforms. One example is the analysis of the LLVM-IR emitted by XCode for apps written in Apple’s programming languages Swift or Objective-C. Their calling conventions differ significantly from other programming languages. As an example, Objective-C makes use of a dynamic dispatching routine which requires extensive tracing of a method’s arguments to recover type information and the method name as a string [18, 64]. This information is present in the CPG but has to be combined to identify the calls. Similarly, it is necessary to model Swift’s calling conventions and memory model since it differs significantly from the one of C++ [65, 66]. However, to date, the differences are not fully explored. Future work should identify differences and integrate this knowledge into the CPG.

Furthermore, software written in C or C++ can rely on macros which are used similar to function calls in the source code and represented as such but are replaced with their specific implementation in LLVM-IR. This discrepancy needs to be addressed appropriately to better analyze such programs. In the current stage, addressing such inconsistencies between source code and the binary is left to manual efforts of the user of the cpg library. Additional efforts are necessary to reduce these manual efforts and ease the usability of the analysis toolchain.

Last, adapting the solution to the analysis of closed-source software is promising. Once such scenario is the analysis of cloud functions while auditing a configuration of the cloud deployment. In such scenarios, the source code might not be available to the auditor but an image (e.g., of a container) together with the compiled binary of the function can be accessed and analyzed. Hence, lifting those functions to LLVM-IR and finally using this representation for an analysis is promising to improve the analysis of cloud setups.

Recent research [14] showed that lifting is a stable technique for many applications. However, lifted or decompiled binaries still suffer from a lack of information which are crucial for a security analysis [8]. Hence, further research should study which gaps still exist to apply existing tools to lifted binaries.

Generalizability.  Since the SSA form is also used by other IRs (e.g. Shimple [67], WALA [63], SIL [68]), some of the challenges generalize to those IRs. Hence, the concepts presented in this paper can be reused to add further code representations using the SSA form to the CPG. Furthermore, some parts of our concept could be ported to other projects which suffer from similar issues. However, the applicability and impact depend on the projects’ data models.

8 Conclusion

We showed how we extended an open source CPG implementation to handle LLVM-IR. While the majority of instructions can easily be mapped to the high-level equivalents, the \(\varphi \) instruction and the LLVM exception handling instructions impose challenges to the translation. However, we could transform the program to the CPG representation with a reasonable increase in nodes while prior work suffered from huge performance penalties. The similarity between the resulting graph and the one of the code fractions in high-level languages allows to reuse existing analyses detecting security weaknesses or bugs. Our evaluation suggests that the approach scales to larger projects. Future work is necessary to include characteristics of some programming languages (e.g. Swift), to add analyses for further use cases, such as analyzing closed-source cloud functions, binaries and side-channel vulnerabilities, and to study the gaps of binary lifting.