Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Malicious reverse engineering represents a great risk to software confidentiality, especially for software that runs on a malicious host controlled by a software pirate intent on stealing intellectual artifacts. Thwarting vicious reverse engineering, using software protection techniques such as code obfuscation, is vital to defend programs against malicious host attacks. An obfuscating transformation attempts to transform code so that it becomes unintelligible to human and automated program analysis tools, while preserving their functionality.

One of the major challenges of code obfuscation is the lack of a rigorous theoretical background. The absence of a theoretical basis makes it difficult to formally analyse and certify the effectiveness of these techniques against malicious host attacks. In particular, it is hard to compare different obfuscating transformations with respect to their resilience to attacks. Often, obfuscation transformation techniques are proposed with no provable properties presented.

Software developers strive to produce structured and easy to comprehend code, their motivation is to simplify maintenance. Code obfuscation, on the other hand, transforms code to a less structured and intelligible version. It produces more complex code that looks patternless, with little regularity and is difficult to understand. We argue that irregularities and noise makes the obfuscated code difficult to comprehend. Kolmogorov complexity [11] is a well known theory that can measure regularities and randomness. The size of the shortest program that describes a program tends to be bigger as more noise and irregularities are present in a program. Kolmogorov complexity is the basic concept of algorithmic information theory, that in many respects adapts the viewpoint of well-established Information Theory to focus on individual instances rather than random distributions. In general, algorithmic information theory replaces the notion of probability by that of intrinsic randomness of a string. Kolmogorov complexity is uncomputable; however it can be approximated in practice by lossless compression [9, 11], which helps to intuitively understand this notion and makes this theory relevant for real world applications. Our aim in this paper is to provide a theoretical framework for code obfuscation in the context of algorithmic information theory: to quantitatively capture the security of code obfuscation, to discuss its achievability and to investigate its limitations and resilience against an adversary.

We introduce the notion of unintelligibility to define confusion in code obfuscation and argue this is not good enough. We then propose our notion of security and compare both definitions. We also present a model for a code obfuscating adversary, specifying the adversary’s capabilities, goals and winning conditions.

We argue that our model of security is fundamentally different from the virtual black-box model of Barak et al., and that because of this their impossibility result does not apply. Then, we show that under reasonable conditions we can have secure obfuscation. Finally, we study the security of two main approaches to obfuscated code in software, encoding and hiding, at the subprogram level.

To the best of our knowledge, this paper is the first to propose algorithmic information theory as a theoretical basis for code obfuscation and its threat model; we believe our work is the first step to derive consistent metrics that measure the protection quality of code obfuscation. The framework can be applied to most code obfuscation techniques and is not limited to any obfuscation method or language paradigm.

Paper Structure. In Sect. 2, we provide an overview of related work. Section 3 provides the preliminaries and the required notations. In Sect. 4, we discuss the intuition behind our approach, propose the formal definition of code obfuscation, the adversary model and present some results for security code obfuscations against passive attackers. In Sect. 5, we study the security of two main approaches to code obfuscation at the subprogram level. Section 6 concludes with the proposed future work.

2 Related Work

Collberg et al. [3] were the first to define obfuscation in terms of a semantic-preserving transformation. Barak et al. [2] introduced a formal definition of perfect obfuscation in an attempt to achieve well-defined security, which is based on the black-box paradigm. Intuitively, a program obfuscator \({\mathcal O}\) is called perfect if it transforms any program P into a virtual black box \({\mathcal O(P)}\) so that anything that can be also efficiently computed from \({\mathcal O(P)}\), can be efficiently computed given just oracle access to P. However, they also proved that the black-box definition cannot be met by showing the existence of set of functions that are impossible to obfuscate. On the other hand, a recent study conducted by Garg et al. [5] provided positive results, using indistinguishability obfuscation, for which there are no known impossibility results. However, as argued by [7] there is a disadvantage of indistinguishability obfuscation: it does not give an intuitive guarantee about the security of code obfuscation.

3 Preliminaries

We use the U as the shorthand for a universal Turing machine, x for a finite-length binary string and |x | for its length. We use \({\epsilon }\) for a negligible value, and \({\varLambda }\) for an empty string. We use the notation O(1) for a constant, p(n) for a polynomial function with input \({n \in \mathbb N}\). \( \mathbin {\Vert } \) is used to denote the concatenation between two programs or strings. \({\mathcal P}\) is a set of binary programs and \({\mathcal P'}\) is a set of binary obfuscated programs, \( \mathcal L=\{\lambda _n \ : \ \lambda _n \in \{0,1\}^+, n \in \mathbb N\} \) is a set of (secret) security parameters used in the obfuscation processFootnote 1. \( \mathbb A=\{\mathcal A_n \}_{n \in \mathbb N} \) represents a set of adversaries (deobfuscators) where an adversary \( \mathcal A\subseteq \mathbb A \) uses a set of deobfuscation techniques (e.g. reverse engineering); the term adversary is used interchangeably with deobfuscator. Given two sets, \( I \) an input set and \( O \) an output set, a program functionality (meaning) is a function \([\![.]\!]\) \( :\mathcal P\times I \rightarrow O \) that computes the program’s output given an input and terminates.

We consider only the prefix version of Kolmogorov complexity (prefix algorithmic complexity) which is denoted by \( K(.) \). Complexity and Kolmogorov complexity terms are sometimes used interchangeably; for more details on prefix algorithmic complexity and algorithmic information theory, we refer the reader to [11]. The necessary parts of this theory are briefly presented in the following.

Definition 1

Let \( U(P) \) denote the output of \( U \) when presented with a program \( P \in \{0,1\}^{+} \).

  1. 1.

    The Kolmogorov complexity \( K(x) \) of a binary string \( x \) with respect to \( U \) is defined as: \( K(x)=\text {min}\{|P| \ : \ U(P)=x\}. \)

  2. 2.

    The Conditional Kolmogorov Complexity relative to \( y \) is defined as: \( K(x \mathop {\mid }y)=\text {min}\{|P| \ : \ U( P, y)=x\}. \)

Definition 2

Mutual algorithmic information of two binary programs \( x \) and \( y \) is given by : \( I_K(x; y) = K(y)-K(y \mathop {\mid }x). \)

Theorem 1

(chain rule [4]). For all x, y \( \in \mathbb {N} \)

  1. 1.

    \( \ K(x;y) = K(x) + K(y \mathop {\mid }x) \) up to an additive term \( O(\log K(x, y)) \).

  2. 2.

    \( K(x)-K(x \mathop {\mid }y) = K(y)- K(y \mathop {\mid }x)\ \text {i.e.} \ I_K(x;y) =I_K(y;x) \), up to an additive term \( O(\log K(x, y)) \).

Logarithmic factors like the ones needed in the previous theorem are pervasive in the theory of Kolmogorov complexity. As commonly is done in the literature, we mostly omit them in our results, making a note in the theorem statements that they are there. This also “hides” smaller constant terms, and for this reason we regularly omit them in the derivations.

Definition 3

The Mutual algorithmic information of two binary strings \( x \) and \( y \) conditioned to a binary string \( z \) is defined as: \( I(x ; y|z) = K(y|z) - K(y|x, z) \)

Theorem 2

([11]). There is a constant c such that for all \( x \) and \( y \)

$$ \begin{array}{rcl} K(x) \le |x|+2\log |x|+ c&\text { and }&K(x \mathop {\mid }y) \le K(x) + c. \end{array} $$

Theorem 3

([13]). Given a recursive computable function \( f:\{0,1\}^* \rightarrow \{0,1\}^* \), for all \( x \) the algorithmic information content of \( f(x) \) is bounded by: \( K(f(x)) \le K(x)+ O(1). \)

If we are aware that \( x \) belongs to a subset \( S \) of binary strings, then we can consider its complexity \( K(x|S) \). Also, we can measure the level of randomness (irregularities) using randomness deficiency.

Definition 4

[11] The randomness deficiency of \( x \) with respect to a finite set \( S \) containing \( x \) is defined as \( \delta (x|S) = \log \#S - K(x|S). \)

Lemma 1

[11] Let S be an enumerable binary set of programs such that \( x \in S \), Then, \( K(x) \le K(S) + \log \#S + O(1) \).

4 Code Obfuscation Using Kolmogorov Complexity

The main purpose of code obfuscation is to confuse an adversary, making the task of reverse engineering extremely difficult. Code obfuscation introduces noise and dummy instructions that produce irregularities in the targeted obfuscated code. We believe that these make the obfuscated code difficult to comprehend. Classical complexity metrics have a limited power for measuring and quantifying irregularities in obfuscated code, because most of these metrics are designed to measure certain aspects of code attributes such as finding bugs and code maintenance. Human comprehension is a key in this case; an adversary has to understand the obfuscated code in order to recover the original. Measuring code obfuscation has to take into consideration this human factor. Although measuring code comprehension is very subjective, there were some successful attempts to measure human cognitive reasoning and cognitive science based on Kolmogorov complexity [6].

Code regularity (and irregularity) can be quantified, as was suggested in [8, 10], using Kolmogorov complexity and compression. Code regularity means a certain structure is repeated many times, and thus can be recognized. Conversely, irregularities in code can be explained as the code exhibiting different types of structure over the code’s body. Regularities in programs were introduced by Jbara et al. in [8] as a potential measure for code comprehension; they experimentally showed using compression that long regular functions are less complex than the conventional classical metrics such as LOC (Line of Code) and McCabe (Cyclomatic complexity) could estimate.

The main intuition behind our approach is based on the following argument: if an adversary fails to capture some patterns (regularities) in an obfuscated code, then the adversary will have difficulty comprehending that code: it cannot provide a valid and brief, i.e., simple description. On the other hand, if these regularities are simple to explain, then describing them becomes easier, and consequently the code will not be difficult to understand.

Fig. 1.
figure 1

Obfuscation example: (a) is the original code for the sum of \( n \) integers; (b) is an obfuscated version of (a) with one opaque predicate and data encoding which has some patterns and regularities; (c) is another obfuscated version of (a) with three opaque predicate and data encoding, which has less patterns and regularities comparing to (b).

We demonstrate our motivation using the example in Fig. 1. We obfuscate the program in Fig. 1-(a) that calculates the sum of the first \( n \) positive integers, by adding opaque predicatesFootnote 2 with bogus code and data encoding. If we apply Cyclomatic complexity (McCabe [12]), a classical complexity measure, to Fig. 1-(b) the result will be 6. Cyclomatic complexity is based on control flow graph (CFG), and is computed by: \( E - N + 2 \), where \( E \) is the number of edges and \( N \) is the number of nodes in CFG. Figure 1-(b) contains \( N=8 \) nodes, \( E=13 \) edges then the Cyclomatic complexity is \( (13 -8 +2)=7 \). We can see some regularity here: there is one opaque predicate repeated three times. Furthermore, the variable \( y \) is repeated three times in the same place of the If-branch. We conjecture that we can find the short description of the program in Fig. 1-(b), due to presence of regularity by using lossless compression.

We take another obfuscated version in Fig. 1-(c) (of the same program); this code is obfuscated by adding three different opaque predicates. The patterns are less in this version comparing to Fig. 1-(b); the shortest program that describes Fig. 1-(c) is likely to be very similar to the code itself, where the Cyclomatic complexity is the same 7, and it does not account for the changes that occurred in the code. Assuming the opaque predicates of Fig. 1-(c) are equally difficult to break, attacking this code requires at least twice more effort than the code in Fig. 1-(b), as we need to figure out the value of two more opaque predicates. Furthermore, Fig. 1-(b) can be compressed at higher rate than Fig. 1-(c); again, this is due to the inherent regularity in Fig. 1-(b). We argue that an obfuscated program which is secure and confuses an adversary will exhibit a high level of irregularity in its source code and thus require a longer description to characterize all its features. This can be captured by the notion of Kolmogorov Complexity, which quantifies the amount of information in an object. An obfuscated program will have more non-essential information, and thus higher complexity, than a non-obfuscated one. Thus, we can use Kolmogorov Complexity to quantify the level of confusion in obfuscated programs due to the obfuscation process.

4.1 Applying Kolmogorov Complexity to Code Obfuscation

In this section, we present a novel approach for code obfuscation based on notions from algorithmic information theory. We start with an intuitive definition that is inspired by practical uses of obfuscation. The rationale behind this definition is that an obfuscated program must be more difficult to understand than the original program. This uses the separate notion of \( c \)-unintelligibility:

Definition 5

A program \(P'\) is said to be \( c \)-unintelligible with respect to another program P if it is c times more complex than P, i.e. the added complexity is \( c \) times the original one, and thus more difficult to understand. Formally:

$$ K(P') \ge (c+1) K(P), $$

for some constant \( c > 0 \).

Definition 6

A \( c \)-Obfuscator \( \mathcal O:\mathcal P\times \mathcal L\rightarrow \mathcal P' \) is a mapping from programs with security parameters \( \mathcal L \) to their obfuscated versions such that \( \forall P \in \mathcal P, \lambda \in \mathcal L\ .\ \mathcal O(P,\lambda ) \ne P \) and satisfies the following properties:

  • Functionality: \( O(P,\lambda ) \) and \( P \) compute the same function, such that \( \forall i \in I . \) \([\![P]\!]\) \( (i)= \) \([\![{\mathcal O(P,\lambda ) }]\!]\) \( (i) \).

  • Polynomial Slowdown: the size and running time of \( \mathcal O(P,\lambda ) \) are at most polynomially larger than the size and running time of \( P \), i.e. for polynomial function \( p \). \( |\mathcal O(P,\lambda )| \le p(|P|) \), and if \( P \) halts in \( k \) steps on an input \( i \), then \( O(P,\lambda ) \) halts within \( p(k) \) steps on \( i \).

  • Unintelligibility: \( \mathcal O(P,\lambda ) \) is c-unintelligible with respect to P.

It is interesting to ask to what extent unintelligibility is related to the quality of the obfuscation parameter \( \lambda \). Is a large \( \lambda \) necessary for high unintelligibility? Is it sufficient? We answer the first question in the positive by showing that \( c \)-unintelligibility sets a lower bound on the size of \( \lambda \).

Lemma 2

Consider a program \( P \) and an obfuscated version \( P' = \mathcal O(P,\lambda ) \) such that \( P' \) is \( c \)-unintelligible with respect to \( P \). Then, \( |\lambda | \ge c K(P) - O(1). \)

Proof

By assumption, \( K(\mathcal O(P,\lambda )) \ge (c+1) K(P) \). To compute \( P' \), we only need \( P \), \( \lambda \) and the obfuscator program \( \mathcal O \) and so we can upper bound \( K(P') \):

$$\begin{aligned} K(P) + K(\lambda |P) + K(\mathcal O)&\ge K(\mathcal O(P,\lambda )) \\&\ge (c+1) K(P) \\ \Rightarrow K(\lambda |P) \ge c K(P) - K(\mathcal O). \end{aligned}$$

Assuming the obfuscator program is simple, that is, \( K(\mathcal O) = O(1) \), we have by basic properties of Kolmogorov complexity: \( |\lambda | \ge K(\lambda ) \ge K(\lambda | P) \ge c K(P) - O(1). \)

To answer the second question, we show a counter-example. So far, we have not addressed the nature of \( \mathcal O \) and how well it uses its obfuscation parameter. It could well be the case that \( \mathcal O \) only uses some bits of \( \lambda \) to modify \( P \). In an extreme case, it can ignore \( \lambda \) altogether and simply return \( \mathcal O(P,\lambda ) = P \). The result satisfies the first two properties of an obfuscator, but can be considered unintelligible only in the degenerate case for \( c = 0 \) and surely we would not call the resulting code obfuscated. Another extreme case is when \( \lambda = P \). Now, we would have at most \( K(\mathcal O(P,\lambda )) \le K(P) + K(\mathcal O) + O(1) \) which again would lead to a very small \( c \). These two cases, although extreme, serve only to show that the quality of an obfuscator depends not only on \( \lambda \) but also on the obfuscation algorithm itself, \( \mathcal O \). This is addressed later in Theorem 7.

Definition 6 is perhaps the first natural definition one can find, but it has one shortcoming. Merely requiring the obfuscated program to be complex overall does not mean that it is complex in all its parts, and in particular, that it hides the original program. To illustrate this point, consider the following example.

Example 1

Consider an obfuscated program \( P'=\mathcal O(P,\lambda )=P \mathbin {\Vert }\lambda \), which is a simple concatenation of \( P \) and \( \lambda \). Define \( n = |P'| \). We know \( K(P \mathbin {\Vert }\lambda ) \simeq K(P, \lambda ) \) within logarithmic precision (see [11] page 663). Then, by applying the chain rule of Theorem 1, \( K(P')=K(P\mathbin {\Vert }\lambda ) \simeq K(P) + K(\lambda |P) + O(\log n) \). For large \( \lambda \) independent of \( P \), this might signify a large unintelligibility, but the original program can be extracted directly from the obfuscated version requiring only \( O(\log n) \) to indicate where \( P \) ends and \( \lambda \) starts.

This leads us to our second definition, where we require not that the obfuscated program be more complex than the original but rather, that it reveal almost no information about the original. This is captured by the notion of algorithmic mutual information and can be stated formally as:

Definition 7

Consider a program P and its obfuscated program \({P' = \mathcal O(P, \lambda )}\). We say \({P'}\) is a \({\gamma }\)-secure obfuscation of P if the mutual information between P and \({P'}\) is at most \({\gamma }\), that is:

$$\begin{aligned} I_K(P;\mathcal O(P,\lambda )) \le \gamma . \end{aligned}$$

We say \({P'}\) is a secure obfuscation of P if is \({\gamma }\)-secure and \({\gamma }\) is negligible.

It is common to consider, in the literature about Kolmogorov Complexity, that logarithmic terms are negligible. Thus, if both \( P \) and \( P' \) have lengths of order \( n \), we might consider that \( P' \) would be a secure obfuscation for \( \gamma = \log n \). This intuition, however, is bound to fail in practice.

Programs are typically redundant, written in very well-defined and formal languages, with common structures, design patterns, and even many helpful comments. It is expected that the complexity of a non-obfuscated program be low, compared to its length. Consider then the case that for a given \( P \) and \( n = |P| \) we have \( K(P) = O(\log n) \). Consider a scenario like that of Example 1, where the obfuscated reveals the original program and so the mutual information between both programs is maximum, i.e., \( I_K(P;\mathcal O(P,\lambda )) = K(P) - O(\log n) = O(\log n). \) Even though this obfuscation can not be considered secure, the resulting mutual information is so small that Definition 7 would declare it secure. We have two ways out of this:

  • we do not consider programs with \( K(P) = O(\log n) \), since this is the error margin of the important properties of Kolmogorov complexity, and at this level we can not achieve significant results;

  • or we consider a relative definition of security, requiring that the mutual information be only at most a negligible fraction of the information in \( P \).

The second option leads us to the following definition:

Definition 8

Consider a program \( P \) and its obfuscated program \( P' = \mathcal O(P, \lambda ) \). We say \( P' \) is a \( \epsilon \)-secure obfuscation of \( P \) if the mutual information between \( P \) and \( P' \) is at most \( \epsilon K(P) \), that is: \( I_K(P;\mathcal O(P,\lambda )) \le \epsilon K(P), \) for \( 0 \le \epsilon \le 1 \).

We say \( P' \) is a secure obfuscation of \( P \) if is \( \epsilon \)-secure and \( \epsilon \) is negligible in some appropriate sense.

4.2 Adversary Model

To properly define security we need to specify the capabilities of our attacker. The most basic case we want to capture is that of a human who seeks to obtain some original code from an obfuscated version thereof, without the assistance of any automated tools. The difficulty of the analyst’s task is measured by the amount of information that s/he lacks to obtain the target code. If the obfuscation is weak, this will be small. A good obfuscation will force the analyst to obtain more information to reach its target, possibly some of the randomness used to execute the obfuscation.

At the other extreme, we have an analyst with access to the complete range of analysis tools. It can execute the obfuscated code (the ‘challenge’), it can run any program on the challenge to obtain static analysis information on that code or it can produce variations of the challenge and subsequently run them. The adversary is not restricted in how many times s/he runs these programs nor in what order. Ultimately, with all the information gathered from this process, the adversary will attempt to produce the original code that was obfuscated into the challenge. This model equally captures automated reverse-engineering analysis techniques, such as static program analysis (e.g., data flow, control flow, alias analysis, program slicing, disassemblers, and decompilers), and dynamic program analysis (e.g., dynamic testing, profiling, and program tracing).

While the first adversary is too limited to be realistic, the powers of the second are too broad to be useful. Every obfuscated program that is obtained by a deterministic transformation (taking clear code and auxiliary randomness as input) can be reversed by undoing the exact same steps in the opposite order. Therefore, for each instance there will be at least one computable function that is able to return the original code. To achieve a meaningful model we have to establish limits to the adversary power and exclude such situations that would make any obfuscation trivially impossible.

A distinction commonly made in the literature is whether the attacker is passive or active. The first kind of adversary is limited to analysing the source code of the challenge, but can’t run it nor modify it. The second adversary can execute the challenge code and modified versions thereof. We model these two kinds of adversary in the choice of computable functions available to the adversary.

To define our adversary model, we take inspiration from the game-based definitions of adversary capabilities commonly used in cryptography, although there are important differences. As standard, we have a game played between two parties, a challenger and an adversary, to assert the security of a particular instance of the obfuscation process \( \mathcal O \). While \( \mathcal O \) is known to both parties, only the challenger knows \( P \), who chooses a random \( \lambda \) and produces \( P' = \mathcal O(P, \lambda ) \). \( (P,P') \) are the instance of the game.

It is customary to first fix an adversary and then select the challenge at random, so that the adversary can not be tailored to a specific instance, i.e., the adversary has no information at all about the messages that are used to construct the challenge. This then leads to a probabilistic analysis of the success of the adversary.

Kolmogorov Complexity theory is often used to analyse individual instances and avoid the analysis of probabilistic ensembles, and this is the route we follow in our approach. We of course also intend to guarantee that the adversary holds no information about the plain code, but instead of requiring this to be picked at random, we simply state that the mutual information between the plain code and the code of a legitimate adversary must be very low.

Thus our model of security is focused on individual instances of an obfuscation process, instead of considering at the same time all possibilities. This makes analysis much easier in a practical setting, since we can choose a particular case of obfuscated code and analyse its security without having to consider all other cases. This also shifts the focus of the security definition in the direction of actual examples of obfuscated code, and away from the obfuscator algorithm per se.

According to this reasoning, we say an adversary is legitimate if and only if its code, \( \mathcal A \), satisfiesFootnote 3. \( I_K(P ; \mathcal A) = O(1) \).

This excludes the trivial case where a particular adversary already knows the source code or a good deal of it, and could therefore win the game without having any “intelligence” to undo the obfuscation. We mentioned before that for a particular instance, it is always possible to find a deterministic algorithm that trivially undoes the obfuscation: for example, an algorithm that simply “knows” the solution and prints it. This case is excluded by this mutual information requirement. Since \( \mathcal A \) can print \( P \) even without knowledge of \( P' \), \( K(P | \mathcal A) = O(1) \) and so \( I(P ; \mathcal A) = K(P) - O(1) \).

Another trivial algorithm is one that does not have any particular intelligence for a general obfuscated program: it just has a hard-coded list of changes that revert \( P' \) to \( P \), for example, a list of the steps in obfuscation process detailing the state before and after that step. But then, this means \( \mathcal A \) already contains in its code parts that are specific to \( P \): some algorithm can look into \( \mathcal A \) and thus reconstruct these parts of \( P \) even without knowing \( P' \). The complexity of these parts is the information contained in \( \mathcal A \) about \( P \) and if \( P \) and \( P' \) are reasonably different, the complexity of this list will be larger than \( O(1) \).

This case is subtly different from one where the obfuscation process is so weak that it is possible to write a simple reversal algorithm, say, \( R \), that simply undoes each step of \( O \). In this case, \( \mathcal A \) would use \( R \) as a list of steps to turn \( P' \) into \( P \), much like in the previous case, but with the difference that these can be applied to any instance \( (P,P') \) of this obfuscator. For example, containing instructions that convert a particular structure into another, or that add or remove specific letters to a variable name. In this case, \( \mathcal A \) knows \( R \), but this does not have any information about a specific \( P \), which makes \( \mathcal A \) a legal adversary, giving evidence that \( \mathcal O \) is a bad obfuscator and instances computed from it are weak.

On the other hand, there is also some lee-way in the adversary’s goal. The most strict goal is to produce the original code. More relaxed conditions would allow the adversary to succeed if it could produce code that was close enough to the original source and had the same functionality. In our scenario, the adversary already knows the functionality to a large degree. If s/he did not, an adequate winning condition would be simply to produce as simple an equivalent version of the obfuscated code as possible, which would mean the adversary had understood the functionality and found a more compact implementation for it. But this does not mean the adversary’s implementation is better than the original in some practical terms, for example, that it is more efficient. Our aim is to represent a situation where the obfuscated implementation of a specific functionality holds some value for the attacker: although the latter might know the full functionality, the way in which this is implemented is not known, and might be better in some practical terms than all the implementations the adversary currently knows.

So, the attacker will succeed if it can produce code that, although not the original, satisfies some properties chosen by the attacker in the same way the original code does. Finally, the result code should not be significantly more complex than the true original. That means they must have similar Kolmogorov complexities. Even if they might not be easily derived from the other, these two conditions guarantee that the adversary’s solution and the true solution are equivalent in practice.

Definition 9

(Victory Conditions). For some program \( P \) and security parameter \( \lambda \), given parameter \( \epsilon \), a program \( P' = O(P, \lambda ) \) and set of properties \( \varPi = {\pi _1, \ldots , \pi _n} \), the adversary wins if it is able to produce a program \( P^* \) such that:

  • \( \pi _i(P^*) = \pi _i(P) \) for all properties \( \pi _i \) in \( \varPi \);

  • \( P^* \) is \( \delta - \)close to \( P \), that is, \( | K(P) - K(P^*) | \le \delta K(P) \) for \( 0 \le \delta \).

The properties in \(\varPi \) represent the characteristics that the adversary values in the target program and that his/her solution should, therefore, preserve. These could be, for example, “P belongs in complexity class X”, or “P uses at most n bits of randomness”. In general terms they might even not be efficiently computable, so for a particular definition to be viable these properties should be chosen with care: that is, they should be represented by programs that compute those properties.

We thus summarize our adversary model:

  • the adversary has no information about the source code used to produce the challenge;

  • the adversary can run for an unlimited amount of time;

  • the adversary can execute auxiliary algorithms that it can use to analyse the challenge;

  • the adversary breaks the obfuscation if it can produce code close enough to the original, and the information it used is less, by some amount, than that of the clear code.

We give a more formal description below:

Definition 10

(Adversary’s Game Parameters)

  • \({(P,P' = \mathcal O(P,\lambda ))}\)

  • Capabilities of the adversary:

    • The type of adversary, i.e., passive or active.

  • Adversaries goals:

    • A set of properties that the target source code satisfies: \(\varPi = {\pi _1, \ldots , \pi _n}\);

    • A proximity parameter \(\delta \).

  • Security parameter:

    • A measure of the adversary’s advantage against the defender.

Definition 11

(Adversary Model)

  • Phase 0:

    • The adversary A must satisfy \(I(P ; A) = O(1)\), otherwise it is not legitimate and the game is meaningless.

  • Phase 1:

    • The adversary selects a set of computable functions representing the analysis techniques s/he intends to use: \(F = {F_1, \ldots , F_k}\). These functions can be constants (i.e., they are simply advice strings).

      • If the adversary is passive, no function is chosen in F, and move to phase 3.

      • If the adversary is active, the adversary can choose any function that takes code and produces other code (static or dynamic analysis).

      • If the adversary is active, \(P'\) is added to the set F.

  • Phase 2:

    • The adversary executes any function in F as many times and in any order it wishes, under the sole constraint that the total execution time does not take more than \(t(|P'|)\).

  • Phase 3:

    • The adversary outputs its guess at the original code: \(P^*\).

The adversary succeeds for security parameter \(\epsilon \) if:

  • The functionality of \(P^*\) and \(P'\) is equalFootnote 4, and \(P^*\) is close to P, that is, \(| K(P) - K(P^*) | \le \delta K(P)\).

  • \(\pi _i(P^*) = \pi _i(P)\), for all \(1 \le i \le q\).

  • Kolmogorov complexity of the adversary’s information, \( K(A, F_1, \ldots , F_k) \), is less than \((1-\epsilon ) K(P)\).

This adversary model using the security game is compatible with our mutual based security definition in Definition 8: if the last condition is true, the adversary successfully produces \( P \) from \( P' \). Therefore, \( K(P|P') \le K(\mathcal A, F_1, \ldots , F_k)< (1-\epsilon )K(P) \) and so \( I_K(P;P')>\epsilon K(P) \).

Notes:

  • The amount of information used by the adversary is the upper limit to the length of the adversary’s own code, plus that of any algorithm the adversary chooses in phase 1. In strict terms, this quantity should be measured by the Kolmogorov complexity of the above. In practice, this can be approximated by the length of a compression thereof or, in the worst case, the original strings.

The next proposition provides an example of how to test the security of an obfuscated code against an adversary.

Proposition 1

Consider a clear program \( P \) of length \( n \) with \( K(P)\ge n - O(\log n) \) and \( \mathcal A \) an adversary that extracts at least \( m\le n \) consecutive bits of \( P \) from \( P'=\mathcal O(P,\lambda ) \) then:

  1. 1.

    \( K(P|P') \le n- m + O(\log n) \).

  2. 2.

    \( I_K(P;P') \ge m - O(\log n) \)

Proof

We prove this proposition by building the following algorithm:

Algorithm:

  • Build \( \mathcal A \) of length \( O(1) \).

  • Compute \( \mathcal A(P') \), we obtain \( m \le n \) bits of \( P \). Denote these by string \( \omega \).

  • Now, build a program \( \beta \) such that \( P=\beta (\omega ) \) which computes two blocks of bits: those that come before and those that come after \( \omega \). To produce \( P \), \( \beta \) needs at most to produce a pair of two strings (\( s_1, s_2 \)) with combined length \( n-m \). To describe the pair, we need at most \( O(\log n) \) bits saying where to divide \( s_1 \) from \( s_2 \). Thus, \( K(\beta ) \le n - m + O(\log n) \).

By construction, \( K(P | \mathcal O(P,\lambda )) + O(1) \le K(\beta ) + O(1) \le n - m + O(\log n). \) Using the assumption about \( K(P) \), it is straight forward to compute

$$\begin{aligned} I_K(P;P') \ge m - O(\log n) \end{aligned}$$

On the other hand, the adversary can fully obtain \( P \), with only a logarithmic error, if it knows \( \lambda \) the security parameter, as the next theorem shows.

Theorem 4

Let \( P'=\mathcal O(P,\lambda ) \), for an adversary \( \mathcal A \) who knows \( \lambda \):

  1. 1.

    \( K(P|P', \lambda )= O(\log n). \)

  2. 2.

    \( I_K(P,P'|\lambda ) = K(P|\lambda ) - O(\log n). \)

where \( n \) is the maximum length of \( P, P' \) and \( \lambda \).

Proof

If \( \lambda \) contains all the knowledge that \( \mathcal A \) requires to obtain \( P \) given \( P' \), then the shortest program for a universal Turing machine \( U \) that describes \( P \) given \( \mathcal A, P' \) and \( \lambda \), is negligible i.e. empty string. It is sufficient for \( U \) to describe \( P \) from \( \mathcal A, P' \) and \( \lambda \) with no need of any extra programs i.e. \( U(\varLambda ,\mathcal A(P'),\lambda )=P \) where \( \varLambda \) is an empty string. \( O(\log n) \) is needed as an overhead cost required by \( U \) to combine \( \mathcal A, P' \) and \( \lambda \) and to locate them on \( U \) tape. The advantage of \( \mathcal A \) given \( \lambda \) is obtained as follows:

$$\begin{aligned} I_K(P; P'|\lambda )&= K(P|\lambda ) - K(P|P',\lambda ) \\&= K(P|\lambda ) - O(\log n) \end{aligned}$$

These results are not surprising. Intuitively, an adversary can easily recover the original code from the obfuscated version once the security parameter that is used for obfuscation is known.

4.3 On the Impossibility of Obfuscation

There exist other definitions of obfuscation in the literature. Of particular importance to us is the work of [1], due to its famous impossibility result. As the authors argue in that paper, the black-box model they propose for obfuscation is too strong to be satisfiable in practice. The black-box model considers a programFootnote 5 \( P' \) obfuscated if any property that can be learned from \( P' \) can also be obtained by a simulator with only oracle access to \( P' \). This essentially states that \( P' \) does not give any particular information about that property, since it is possible to learn it without having access to \( P' \). Notice that this model does not compare an obfuscated program with an original one, but rather with its functionality. This is different from the definitions that we have proposed so far. Our definitions can be used to capture this purpose, namely, measuring how much information a program \( P' \) gives about the function it computes, which we denote by \( [\![P' ]\!] \).

It suffices to note that every function \( F \) has a minimal program for it, say, \( Q \). Then, its Kolmogorov complexity is the size of \( Q \) and for every other program \( P' \) computing \( F \) we have \( K(F) = |Q| \le |P'|. \)

For every such program it must be that \( K(F) \le K(P') + O(1) \), otherwise we could build a program \( R \) of size \( K(P') \) that produced \( P' \) and then ran in succession \( R \) and \( U(R) \).Footnote 6 This composition of programs is itself a program with complexity \( K(P') + O(1) \) which would be smaller than the assumed minimal program \( F \), i.e., a contradiction. Therefore, our definition can be changed to compare the obfuscated program not with any simpler program but with the simplest program computing the same function.

Definition 12

Consider a program \({P'}\). We say \({P'}\) is \({\epsilon }\)-securely obfuscated if

$$\begin{aligned} I_K([\![P' ]\!];P') \le \epsilon K([\![P' ]\!]), \end{aligned}$$

for \( 0 \le \epsilon \le 1 \).

We say \({P'}\) is a secure obfuscated program if is \({\epsilon }\)-secure and \({\epsilon }\) is negligible.

We believe our definitions of obfuscation differ from the simulation black-box model in important ways, allowing us to avoid the impossibility result of [1].

Our definition is a less stringent form of obfuscation rather than a weak form of black box obfuscation. We assume the functionality of an obfuscated program is almost completely known and available to an adversary, and only require hiding the implementation rather than the functionality itself. This approach to obfuscation is very practical and pragmatic, especially for software protection obfuscation, as usually the customer likes to know exactly what a product does, although s/he might not care about how it was implemented.

Our definition for security takes a program \( P \) (clear code), which supposedly is an easy and smart implementation of functionality \( F \), and compares it with \( P' \), which is a different and supposedly unintelligible implementation of \( F \), such that the original \( P \) can not be perceived or obtained from \( P' \). The defenders’ aim is not to prevent an adversary from understanding or finding \( F \), but to prevent her/him from finding their own implementations \( P \).

This intuition best matches the idea of best possible obfuscation which was advanced by Goldwasser and Rothblum [7]. According to [7] an obfuscator \( \mathcal O \) is considered as best possible if it transforms a program \( P \) so that any property that can be computed from the obfuscated program \( P' \), can also be computed from any equivalent program of the same functionality. However, despite the close intuitive correspondence, our definition also differs from best possible obfuscation, in the sense that it relies on some form of black-box simulation. It was proved in [7] that best possible obfuscation has a strong relation with indistinguishability obfuscation [2, 5], if \( \mathcal O \) is an efficient obfuscation i.e. run in polynomial time.

In contrast, the black box model definition requires that all properties of a given obfuscated program \( P \) must be hidden from any adversary that ignores its source code but has access to at most a polynomial number of points of \( [\![P ]\!] \).

  1. 1.

    We can see that in our case, the adversary knows more about the functionality. Since the functionality is mostly public, this would be equivalent to giving the simulator in the black-box model access to this extra knowledge, reducing the advantage of the adversary and possibly making some functions obfuscatable.

  2. 2.

    On the other hand, our definition allows the leakage of a small, but non-zero, amount of information. Compare with the black-box model where a single-bit property that is non-trivially computed by an adversary, renders a function un-obfuscatable. Our definition requires the adversary to be able to compute more than a few bits in order for obfuscation to be broken.

  3. 3.

    Our definition considers only deterministic adversaries, again making adversaries less powerful and reducing their advantage.

We can try to model the implications for our definition of a successful black-box adversary against obfuscation. Consider an adversary \( \mathcal A \) attacking a predicate \( \pi \) by accessing an algorithm \( A \) , such that \( A(P') = 1 \) if and only \( P' \) satisfies \( \pi \). In this case, \( \mathcal A \) is able to compute 1 bit of information about \( P' \), and we want to measure how much this helps \( \mathcal A \) in describing some simpler program \( P \) that implements the same function \( [\![P' ]\!] \). Since the adversary \( \mathcal A \) knows \( A \), s/he can enumerate the set \( S \) of all programs that satisfy \( A \). Then, for some program \( P \in S \), we would have \( K(P|A) \le K(S|A) + \log |S| \).

Note that the set \( S \) may be infinite, and so enumeration is the best that can be done: we enumerate all relevant programs and run \( A \) with each of them, noting the result. We could try to avoid this infinity problem by noticing we are only interested in programs simpler than the original, and thus satisfying \( K(P) \le K(P') \le |P'| \). This does not give absolute guarantees, since in general there are programs with \( |P| > |P'| \) and \( K(P) \le K(P') \), but our hope is that these are few and far between as they increase in length. Thus, even if we make this assumption and disregard a whole class of possibly relevant programs, we still have in the order of \( 2^n \) specimens. If \( A \) were a deterministic algorithm, we would have \( K(S|A) = O(1) \), and if \( S \) has less than a half of all possible programs, then indeed we would find that \( K(P|A) \le K(S|A) + \log |S|/2 \le n-1. \)

However, \( A \) is randomized, and in order to accurately produce \( S \), for each program \( q \), \( A \) must be run with a set of random coins \( r \) such that \( A(q,r) \Leftrightarrow q \in S \). One way to describe \( S \) from \( A \) would need a polynomial number of bits for each program in \( S \). Now, we no longer have the comfort of a negligible \( K(S|A) \) term, and we can no longer be sure that knowing this property would in any way reduce the complexity of our target program.

We can still try to go around this problem, by allowing our enumerator to list not only all programs but also all possible random strings, and choosing the majority vote for each program. The length, and therefore the number, of possible random strings is bound by the running time of the program, which in turn is bound by a function of its length. Therefore, if we limit the length of our programs we limit the number of random strings to search.

This would eliminate the necessity of considering the extra information due to the random coins, but on the other hand the running time would increase exponentially. Any successful adversary would be very different to the PPT adversaries of [2], and although our definition has been made, for ease of exposition, with unbounded Kolmogorov complexity (for unbounded adversaries), it is easy to change it to consider polynomially-bounded adversaries by using an alternative definition of mutual information:

$$\begin{aligned} I_K^*(P ; P') = K(P) - K^{t()}(P |P'), \end{aligned}$$

where \( t() \) is a polynomial on the length of \( P' \) that acts as a hard limit to the number of steps the universal machine can be run for. This notion of information can be negative in some cases, but clearly limits the ability of any adversary trying to find \( P \) from \( P' \) in a consistent way with the black-box model. With this definition of obfuscation, the above reasoning would lead to examples where non-obfuscatability by the black-box model does not prevent obfuscatability in the algorithmic information-theoretic model.

4.4 Security and Unintelligibility

Our first attempt to characterize obfuscation was based on unintelligibility, and then we evolved to a notion of security based on mutual information. The first notion seemed more immediately intuitive, traditional obfuscation techniques seem to rely only on making the code as complex as possible in an attempt to hide its meaning. But precisely this notion of “hiding meaning” is nothing more than reducing the information that the obfuscated program leaks about the original one, and so we believe the second approach to be the correct one. However, we can ask the natural question: is there any relation between these two concepts? Does high unintelligibility imply high security, or vice-versa?

We give a partial answer to this question. In certain situations, high unintelligibility will imply high security, as stated in the following theorem.

Fig. 2.
figure 2

An example for obfuscating a program using renaming technique.

Theorem 5

Consider a program \( P \) of length \( m \) and its obfuscated version \( P'=\mathcal O(P,\lambda ) \) of length \( n > m \) (where n is at most polynomially larger than \( m \)), satisfying \( c \)-unintelligibility for \( c \ge \frac{n}{K(P)} \). Assuming that the obfuscation security parameter \( \lambda \) satisfies \( K(P'|P) \ge K(\lambda | P) - \alpha \), then up to \( O(\log n) \):

$$\begin{aligned} I(P;\mathcal O(P,\lambda )) \le \alpha - O(1) \end{aligned}$$

Proof

By Theorem 3, \( K(P') = K(\mathcal O(P,\lambda )) \le K(P,\lambda ) + K(\mathcal O) = K(P,\lambda ) \) Footnote 7 and by assumption \( K(P') \ge (c+1) K(P) \). Then,

$$\begin{aligned} K(P,\lambda )&\ge (c+1) K(P) \\ K(P,\lambda ) - K(P)&\ge c K(P) \\ K(\lambda | P)&\ge c K(P) \end{aligned}$$

Now, for mutual information, we have

$$\begin{aligned} I(P;P')&= K(P') - K(P'|P) \\&\le K(P') - K(\lambda | P) + \alpha \ \text {(by assumption)} \\&\le n - c K(P) + \alpha \\&\le n - n + \alpha \ \text {(by assumption)} \\&= \alpha \end{aligned}$$

Intuitively, if we consider an optimal obfuscation key (it has all the information needed to produce \( P' \) from \( P \), but not much more than that), we can say that if \( P' \) is c-unintelligible for large enough \( c \), then \( P' \) is a secure obfuscation of \( P \).

The above theorem shows when high \( c \)-unintelligibility implies security of code obfuscation. It turns out that the reverse implication does not exist, as the following theorem illustrates.

Claim

There are obfuscated functions \({\mathcal O(P,\lambda )}\) that are arbtrarily secure and yet do not satisfy c-unintelligibility for \(c \ge 0\).

Proof

Consider first the case of program \( P \) of Fig. 2, that simply checks a password for access, and its obfuscated version \( P' \), which was computed using layout obfuscation [3]: variable and function renaming and comment deleting. The obfuscated variable and function names are independent of the original ones and so the information that \( P' \) contains about \( P \) is limited to the unchanging structure of the code: assignment, test and if branch.

The complexity of the original code can be broken in several independent parts: the comment, the structure, the variable and function names, and therefore we can write \( K(P) = n_c + n_s + n_v \). The only thing that \( P' \) can give information about is the structure part, since all the other information was irrevocably destroyed in the process: there is no data remaining that bears any relation to the lost comment or the lost function names. Therefore, \( I_K(P;P') \le n_s = \frac{n_s}{n_c + n_s + n_v}K(P) \). We can make the fraction \( \frac{n_s}{n_c + n_s + n_v} \) as small as necessary by inserting large comments, long names and representing structure as compactly as possible, for example, keeping names in a dictionary block and indicating their use by pointers to this.

However the complexity of \( P' \) is less than that of \( P \), since there is less essential information to describe: the same structure, no comments and the function names could be described by a simple algorithm. Therefore, we have that c-unintelligibility can not be satisfied for any non-negative value of \( c \). This shows that high \( \epsilon \)-security does not imply high unintelligibility.

The next theorem shows how we can obtain security if the obfuscated code is complex enough and the obfuscation key is independent of the original program.

Theorem 6

Let \( P \) be a program of length \( n \) and \( \lambda \) an independent and random obfuscation key, satisfying \( K(\lambda ) \ge n - \alpha \) and \( K(P,\lambda ) \ge K(P) + K(\lambda ) -\beta \), where \( \alpha ,\beta \in \mathbb N \). Suppose the obfuscation \( \mathcal O(P,\lambda ) \) satisfies \( K(\mathcal O(P,\lambda )|P) \ge K(\lambda |P) - O(1) \). Then up to a logarithmic factor: \( I_K(P;\mathcal O(P,\lambda )) \le \alpha + \beta \).

Proof

$$\begin{aligned} I_K(P; \mathcal O(P,\lambda ))&= K(\mathcal O(P,\lambda )) - K(\mathcal O(P,\lambda )|P) \\&\le K(\mathcal O(P,\lambda )) - K(\lambda |P) \\&\le K(\mathcal O(P,\lambda )) - K(\lambda ,P) +K(P) \ \ \text {(By Theorem}\,\text {1)} \\&\le n - K(P) - n + \alpha + \beta + K(P) \le \alpha + \beta \end{aligned}$$

The first two assumptions are natural: picking \( \lambda \) at random and independently of \( P \) will satisfy high complexity and low mutual information with high probability, so we can simply assume those properties at the start. The third assumption, however, is not immediately obvious: it describes a situation where the obfuscation key is optimal, in the sense that it contains just the amount of information to go from \( P \) to \( P' \), within a small logarithmic term. The following lemma shows how \( \lambda \) must have a minimum complexity to allow the derivation of \( P \) to \( P' \).

Lemma 3

Consider \( P' = \mathcal O(P,\lambda ) \) is the result of obfuscating a program \( P \). Then, \( K(\lambda |P) \ge K(P'|P) - O(1) \)

Proof

Given the obfuscator \( \mathcal O \) and any \( \lambda \), construct the function \( q_\lambda (\cdot ) = \mathcal O(\cdot , \lambda ) \). Let \( Q_\lambda \) be a program that implements it. Then, clearly, \( U(Q_\lambda , P) = P' \) and so \( |Q_\lambda | \ge K(P' | P) \). To describe \( Q_\lambda \), we only need to specify \( \lambda \) and instructions to invoke \( \mathcal O \) with the proper arguments, but since \( P \) is already known, we can use it to find a shorter description for \( \lambda \). This gives \( K(P'|P) \le K(\lambda |P) + O(1) \).

An optimal obfuscation key \( \lambda \) is then the one that uses as little information as possible. Obfuscation techniques can use randomness or not. In Sect. 4.4, we showed one case where names could be obfuscated in a deterministic way, without any randomness. However, we could equally have used instead highly random names, with the same effect in security but increasing unintelligibility as well. We can as well consider that the set of obfuscation techniques is finite and describe each of them by a unique number. This way, we can characterize a single application of obfuscation by a key composed of the technique’s index and the randomness needed.

We now proceed to show that it is possible to achieve obfuscation security according to our definitions, but restricted to a passive adversary, that is, one that does not realize transformations over the intercepted code. The intuition is to use obfuscation techniques that behave as much as possible as secure encryption functions, namely, using random keys that are independent from the code and large enough that they obscure almost all original information. The crucial difference that enables security is that because an obfuscation technique preserves functionality, we do not need to decrypt the obfuscated code and so don’t need to hide the key.

First, we prove the effect of obfuscating an elementary piece of code, by application of a single obfuscation technique. Then, we reason about the case of a full program, composed of several of these independent blocks.

Theorem 7

Let \( p \) represent a program block, \( \mathcal O \) an obfuscation technique and \( \lambda \in \mathcal L \) an obfuscation key with fixed length \( n \). Let \( p' = \mathcal O(p, \lambda ) \) be the obfuscated block. Assume \( \mathcal O \) produces an output with length \( \ell \le n + \gamma \) and is “nearly-injective” in the following sense: for every \( p \), any subset of \( \mathcal L \) of keys with the same behaviour for \( p \) has cardinality at most polynomial in \( n \). That is, for all \( p \) and \( \lambda _0 \in \mathcal L \), \( \left| \{\lambda : \mathcal O(p,\lambda ) = \mathcal O(p,\lambda _0) \} \right| \le n^k \), for some positive integer \( k \).

Then, if the key is random, \( K(\lambda ) \ge n - \alpha \) and independent from \( p \), \( K(\lambda |p) \ge K(\lambda ) - \beta \), the obfuscated code \( p' \) is \( (\alpha + \beta + \gamma ) \)-secure up to a logarithmic term.

Proof

By symmetry of information, we can write \( K(p|p') = K(p'|p) + K(p) - K(p') \). Since \( p' = \mathcal O(p,\lambda ) \), it’s easy to see that \( K(p'|p) \le K(\mathcal O) + K(\lambda |p) = K(\lambda |p) \), since \( K(\mathcal O) \) is a constant independent of \( p \), \( \lambda \) or \( p' \). As well, we can show the reverse inequality. To produce \( \lambda \) from \( p \), we can first produce \( p' \) from \( p \) (with a program that takes at least \( K(p'|p) \) bits) and then build the set \( S_{p,p'} = \{ \lambda : |\lambda | \le n, \mathcal O(p,\lambda ) = p'\} \) of all compatible \( \lambda \), by a program whose length is \( O(1) \). Finally, we just have to give the index of \( \lambda \) in this set, and so \( K(\lambda |p) \le K(p' | p) + \log \#S_{p,p'} \). Then, by assumptions on \( \lambda \), \( K(p'|p) \ge K(\lambda |p) - \log \#S_{p,p'} \ge n - \alpha - \beta - \log \#S_{p,p'} \). By assumption on the output of \( \mathcal O \), \( K(p') \le |\mathcal O(p,\lambda )| \le n + \gamma \). This gives \( K(p|p') \ge K(p) + n - \alpha - \beta - O(\log n) - K(p') \ge K(p) - \alpha - \beta - \gamma - O(\log n) \).

The randomness and independence conditions for the keys are natural. The other two conditions may seem harder to justify, but they ensure that \( \mathcal O \) effectively mixes \( p \) with the randomness provided by \( \lambda \): the limit on the size of subsets of \( \mathcal L \) implies a lower bound on the number of possible obfuscations for \( p \) (the more the better); on the other hand, the limit on the length of the output of \( \mathcal O \) forces the information contained in \( p \) to be scrambled with \( \lambda \), since it can take only a few more bits than those required to describe \( \lambda \) itself. The extreme case is similar to One-Time Pad encryptionFootnote 8, when both the output and \( \lambda \) have the same size \( n \), and \( \mathcal O \) is injective for each \( p \): there are \( 2^n \) keys, as well as possible obfuscations for each \( p \). Furthermore, because the obfuscated code has the same length of \( \lambda \), the exact same obfuscated strings are possible for each \( p \), maximizing security.

In general, a program is composed of several of these minimal blocks in sequence. The above proof shows when the obfuscated block \( p' \) gives no information about its original block, say \( p_0 \). As well, \( p' \) can not give any more information about any other block \( p_1 \), as there is no causal relation between \( p_1 \) and \( p' \). At best, there is some information in \( p_1 \) about \( p_0 \), but then the information given by \( p' \) about \( p_1 \) should be at most that given by about \( p_0 \). Therefore, we conclude that if all the sub-blocks in a program are securely obfuscated, then the whole program is. The above theorem, then, shows that secure obfuscation is possible under very reasonable assumptions.

4.5 Normalized Kolmogorov Complexity

Kolmogorov Complexity is an absolute measure, which is problematic when we want to compare two programs with different sizes. For example consider a program \( P \) of 1000 bits size that can be compressed to 500 bits, take another program \( Q \) of \( 10^6 \) bits size, which is compressed to 1000 bits. By using the absolute measure of Kolmogorov complexity, \( Q \) is more complex than \( P \). However, \( P \) can be compressed to almost half of its size, where \( Q \) can be compressed to \( \frac{1}{1000} \) of its size, which clearly indicates that \( Q \) has more regularities than \( P \), and hence that makes \( P \) more complex than \( Q \). In order to overcome this issue, we suggest a normalized version of Kolmogorov Complexity that is relativized by the upper bound of Kolmogorov complexity i.e. the maximum complexity a certain obfuscated code can achieve. Kolmogorov complexity is upper bounded by the length of its program, the subject of measure, according to Theorem 2; this bound can be used as the maximum Kolmogorov complexity. Normalized Kolmogorov complexity can be useful demonstrating the divergence of obfuscated code complexity before and after a given attack, in terms of information content (high variability of text content), from the maximum value of that complexity.

Definition 13

The normalised Kolmogorov complexity \( NK \) of a program \( P \) is given by:

$$\begin{aligned} NK(P) = \frac{K(P)}{|P| + 2\log (|P|)} \end{aligned}$$

Where \( |P| \) is the length of \( P \).

A high value of \( NK \) means that there is a high variability of program content structure, i.e. high complexity. A low value of \( NK \) means high redundancy, i.e. the ratio of repeating fragments, operators and operands in code. Since Kolmogorov complexity can be effectively approximated by compression ([9]), it is possible to estimate \( NK \) by using a standard compressor instead of a minimal program:

$$\begin{aligned} NC(P) = \frac{C(P)}{|P| + 2\log (|P|)}, \end{aligned}$$

implying \( 0 \le NC \le 1 \).

5 Individual Security of Code Obfuscation

Studying the security of individual instances of obfuscated code provides more granularity. Even if the obfuscated program is considered secure according to our definition, it may have parts which can provide some information about other obfuscated parts, which reduce the security of the obfuscated code. It could be that a program is obfuscated but that some module is not: some part of the obfuscated code stays still in its original form. We can demonstrate this relation by providing some boundaries on the complexity of subprograms in the same program.

We can view (obfuscated) programs as finite, and therefore recursively enumerable, sets of subprograms (blocks or modules) such that \( P'=\{p'_n\}_{n \in \mathbb N} \). Given an obfuscated program \( P' \), it may consist of obfuscated and unobfuscated modules, that is: \( \exists p'_j, p'_i \in P' \), where \( p'_i \) is an obfuscated module and \( p'_j \) is an unobfuscated module. Theorem 6 demonstrates the effect of security parameters \( \lambda \) on the whole obfuscation process. The following results show the effect of each individual security parameter on each obfuscated subprogram. Choosing a good \( \lambda \) (with a good source of randomness) requires a minimum amount of information to be shared with obfuscated code and the clear code. It is important to study the relation between the security parameter and the original (clear) code. In the following theorem, we use a simple way to check the independence between \( \lambda \) and \( P \) on the subprogram level.

Theorem 8

Let \( P' \) be a set of obfuscated subprograms and \( P \) a set of clear subprograms, such that each subprogram \( p' \) of \( P' \) has length at most \( n \), and is the obfuscation of a corresponding block in \( P \): \( \exists p_i \in P, \kappa _i \in \lambda \). \( p' =\mathcal O(p_i,\kappa _i) \). If \( K(\kappa _i,p_i) \ge K(p_i) + K(\kappa _i) - \alpha \), then (up to a logarithmic term): \( I_K(\kappa _i;p_i) \le \alpha \).

Proof

Since \( K(p_i,\kappa _i) \le K(\kappa _i|p_i) + K(p_i) + O(\log n) \), we have by assumption

$$\begin{aligned} K(p_i) + K(\kappa _i) - \alpha&\le K(\kappa _i|p_i) + K(p_i) + O(\log n) \\ K(\kappa _i) - \alpha&\le K(\kappa _i|p_i) + O(\log n) \\ K(\kappa _i) - K(\kappa _i|p_i)&\le \alpha + O(\log n) \\ I_K(\kappa _i,p_i)&\le \alpha + O(\log n) \end{aligned}$$

The following two theorems address the security of two different forms of code obfuscation: obfuscation-as-encoding and obfuscation-as-hiding. In the obfuscation-as-encoding technique, the original program is transformed in such a way it changes the structure of original code, but preserving the functionality, for example Data transformation techniques such as array splitting, splitting variable, Restructure Arrays and Merge Scalar Variables [3]. The encoding process is considered as the security parameter that dictates how the obfuscation should be performed and where it should take place. Encoding differs from encryption, if somebody knows the encoding process, then the original code can be recovered. In encoding obfuscation, the clear program is not presented in the obfuscated code; what still exists, but hidden, is the encoding process. Reversing the encoded program (obfuscated) requires finding and understanding the encoding process. For instance we used a simple encoding in Fig. 1, x=x+i is encoded as x=x+4*i;x=x-2*i;x=x-i; the encoding process converts i to 4*i;-2*i;-i, to figure out x=x+i , we have to find and combine 4*i;-2*i;-i, which is the security parameter in this case. The next theorem addresses the security of obfuscation code when we apply encoding to \( P \) (as a set of subprograms ), here, the encoding process is presented as a part of security parameter.

Theorem 9

(Encoding). Let \( P' \) be a collection of obfuscated subprograms \( p_i' \) using \( \kappa _i \in \lambda \), each of length at most \( n \). Then, for \( \delta _{\kappa _i}=\delta (\kappa _i|P') \): \( I_K(\kappa _i;p_i') \le \delta _{\kappa _i} - O(1). \)

Proof

Since \( P' \) is a collection of sub-programs, we can assume that it contains all the information in \( p'_i \) as well as that of all other sub-programs. Then,

\( K(\kappa _i|P') = K(\kappa _i|p'_1, \ldots , p'_i, \ldots , p'_n) \le K(\kappa _i|p'_i) \) and so

$$\begin{aligned} I_K(\kappa _i;p_i')&= K(\kappa _i) - K(\kappa _i|p'_i) \\&\le K(\kappa _i) - K(\kappa _i|P') \\&\le K(\kappa _i) - (\log \#P' - \delta _{\kappa _i}) \ \text {by Definition 4} \\ \end{aligned}$$

Because each subprogram has length at most \( n \), \( P' \) can contain at most \( 2^n \) distinct programs. Assuming that each appears at most a constant number of times, we have that \( \#P' = O(2^n) \) and \( \log \#P' = n + O(1) \). Then,

$$\begin{aligned} I_K(\kappa _i;p_i')&\le n - n + \delta _{\kappa _i}- O(1) \\&\le \delta _{\kappa _i} - O(1) \end{aligned}$$

In hiding obfuscation techniques the original subprogram still exists in the obfuscated program (set of obfuscated subprograms), the security of such techniques depends on the degree of hiding in the set obfuscated subprograms. An example of such technique is the control flow obfuscations such as Insert Dead basic-blocks, Loop Unrolling, Opaque Predicate and Flatten Control Flow [3]. Normally these techniques are combined and used with encoding obfuscation techniques, in order to make the code more resilient to reverse engineering techniques.

For code obfuscation, an opaque predicate is used as a guard predicate that cannot statistically be computed without running the code; however the original code still exists too in the obfuscated code, but protected by the predicate. In Fig. 1 we used opaque predicates with simple data encoding technique. Consider the following obfuscated code of Fig. 1-(a), where the encoding has been removed. Obviously, x=x+i is still in the code, but is hidden under the protection of opaque predicate.

figure a

Theorem 10

(Hiding). Let \( P' \) be a collection of obfuscated subprograms \( p_i' \), each of length at most \( n \). Then, for \( \delta _{p_i}=\delta (p_i|P') \): \( I_K(p_i;p_i') \le \delta _{p_i} - O(1). \)

Proof

The proof is very similar to Theorem 9. The block \( p_i \) is hidden in \( P' \) but in its original form, due to the obfuscation process. Since \( P' \) is a collection of sub-programs, we can assume that it contains all the information in \( p'_i \) as well as that of all other sub-programs. Then, \( K(p_i|P') = K(p_i|p'_1, \ldots , p'_i, \ldots , p'_n) \le K(p_i|p'_i) \)

$$\begin{aligned} I_K(p_i;p_i')&= K(p_i) - K(p_i|p'_i) \\&\le K(p_i) - K(p_i|P') \\&\le K(p_i) - (\log \#P' - \delta _{p_i}) \ \text {by Definition}\,\mathrm{4} \\ \end{aligned}$$

Similarly to the proof of Theorem 9, \( \#P' = O(2^n) \) and \( \log \#P' = n + O(1) \). Then,

$$\begin{aligned} I_K(p_i;p_i')&\le n - n + \delta _{p_i}- O(1) \\&\le \delta _{p_i} - O(1) \end{aligned}$$

6 Conclusion and Future Work

In this paper, we provide a theoretical investigation of code obfuscation. We defined code obfuscation using Kolmogorov complexity and algorithmic mutual information. Our definition allows for a small amount of secret information to be revealed to an adversary, and it gives an intuitive guarantee about the security conditions that have to be met for secure obfuscation. We argued our definition is more lenient than the virtual black-box model of Barak et al. and for that reason the impossibility result does not apply. In contrast, we showed that under reasonable circumstances we can have secure obfuscation according to our definition.

To the best of our knowledge, this paper is the first to propose algorithmic information theory as a theoretical basis for code obfuscation and its threat model. We believe that our new definition for code obfuscation provides the first step toward establishing quantitative metrics for certifying code obfuscation techniques. Currently, we are working toward deriving new metrics based on our model, aiming to validate and apply these metrics, empirically, to real obfuscated programs using state of the art obfuscation techniques. There are still some questions we want to address in future work. For example, it is still not clear whether the complexity of security parameter (key) has always a positive effect on the the security of obfuscated programs based on algorithmic mutual information definition. Furthermore, algorithmic mutual information has a parallel counterpart based on classical information theory such as Shannon mutual information, it would be interesting to explore the relation between our definition and Shannon mutual information in the context of code obfuscation security. We are also planning to study and characterize the security of particular techniques and to analyze more carefully the scenario of active adversaries.