Keywords

1 Introduction

Analyzing and characterizing program behavior are important and challenging problems.Footnote 1 In general, a new version of a program is created by adding additional functionality to the original vanilla version of the program. However, the core behavior of the program remains the same as the vanilla version. Such analysis has extensive applications in program classification scenarios, such as malware analysis and signature generation. In malware, the changes are made to hide the program from anti-virus signature databases. Thus, it is important to develop approaches that can characterize the core program behavior of programs, which are evolved from the original version.

Although program behavior can be learnt by executing the program or emulating it in a sand box and collecting traces, this is an expensive approach and there are no guarantees that the entire program behavior would be captured within the given sample execution. We focus on static analysis-based techniques, as they are light weight and fast when compared with run-time approaches, such as sand boxing and dynamic program tracing. In the past, system calls have been used for modeling program behavior [15] and for detecting malware. Other approaches like [6] used frequency distribution of critical API calls to detect the presence of malware behavior in a program. However, these approaches have high false positive rates, as frequency distributions can be common across programs that are dissimilar in behavior.

To extract behavioral aspects, API call sequences are shown to be quite useful [730]. For example, a particular calling sequence could be typically constant across all programs evolved from the same base program. However, extracting such a sequence could be cumbersome, as a forward analysis on the control-flow graph can have an exponential number of such sequences. Furthermore, even if the program analyst extracts all such sequences, it is non-trivial to identify which of these sequences characterizes the core program behavior.

To address this problem, we use the notion of critical API calls, which are the most frequently occurring API calls in the original version of the software. We focus only on the sequences generated by the critical API calls, and thereby reduce the computational effort for the program analyst. Given this insight, our approach consists of a backward tracing algorithm, which consists of a sequence of instructions that have either directly or indirectly modified the API call’s arguments. For a given program, we use two versions from the same class and generate all the backward traces. Next, we use a statistical profiling technique that compares the lengths of the sequences and stores the result of the comparison. This serves as the signature for the class of programs and the statistical threshold is used to check if any other program exhibits similar behavior. We have tested our technique on two Windows executables, notepad.exe and mstc.exe, by introducing some obfuscations and adding additional call sequences.

In this paper, we present a static analysis approach which extracts the behaviors of a program as ordered sequences of critical API calls. Our contributions are as follows:

  • Capturing Program Behavior using Backward Traces. For each critical API call in the program binary, a backward trace is the sequence of all other critical API calls that effect the arguments of the concerned API call, resulting in accurate modeling of program behavior. We describe an algorithm that can perform this tracing.

  • Program Similarity Testing. We have experimented by obfuscating two normal Windows application programs by adding additional redundant functionality and determining their similarity to the original program. Our preliminary experiments show that a vanilla version has to be obfuscated by introducing at least 20 % of new functionality to evade detection. These results show that our approach can effectively classify programs based on their behavior.

The paper is organized as follows. In Sect. 2, we present the related work which has been done in the past in the field of program behavior analysis and malware detection. In Sect. 3, we describe our backward tracking approach for behavioral tracing. In Sect. 4, we describe a prototype implementation of our approach, present experimental results, and evaluate the effectiveness of our approach. Finally, in Sect. 5, we conclude and discuss future work.

2 Related Work

Much of the related work in this domain has come from the area of malware detection. We cover some important results. Cohen [13] and Chess and White [14] use sandboxing to detect viruses. They proved that, in general, the problem of virus detection is undecidable. Our work is closely related to the work done by Bergeron et al. [1]. They used slicing on assembly code of a program to extract code fragments that are critical from the security standpoint. In [8], the authors have used the static analysis to generate a critical API graph which they compare against a security policy to detect the presence of malicious code in the program. In [10], Christodorescu and Jha use static analysis to detect malicious code in executables. Their implementation called SAFE handles simple obfuscations used by malware writers, such as insertion of NOPs between instructions, that are used to evade detection. Bergeron et al. [8] consider critical API calls and security policies to test the presence of malicious code. Their approach does not work for obfuscated malicious executables. Bilar [15] uses statistical structures, such as opcode frequency distribution and graph structure fingerprints, to detect malicious programs.

In [16], Christodorescu et al. exploited dynamically captured semantic heuristics to detect obfuscated malware. DOME [17] uses static analysis to detect system call locations and run-time monitoring to check all system calls that are made from a location identified during static analysis. Min Sun et al. [3] use dynamic monitoring to detect worms and other exploits. Sai et al. [6] used frequency distribution of API calls in a program to differentiate between malware and benign programs.

A more closely related work is the work in [18], where the authors attempt to capture behavioral similarities across malware variants of a same family of malware. They used execution traces of a malware sample to create a malspec (malicious specification) for the sample. This malspec is able to detect other variants of the malware sample. It is a dynamic analysis technique and uses subgraph matching for comparing two malspecs, which increases detection time considerably. Similar approaches have recently been proposed in [5, 7, 12]. In [11], the authors use data flow graphs to quantify the behavior of malware, and thereby detect them effectively. However, their approach is specific to malware detection and cannot be applied to program similarity estimation.

3 Our Approach

In this section, first, we outline our approach behavioral tracing. We call the behavioral trace of a program as signature of the program. Next, we describe, in detail, our program behavior model used for signature generation and the statistical comparison technique. Our model generates signatures for the vanilla (the earliest version) of the program and calculates the similarity of this signature with the future versions of the program.

The behavior of a program can be specified based on the API calls that the program and its versions use. For instance, a virus trying to search for executable files will typically make use of API calls, such as FindFirstFileA, FindNextFileA and FindClose, in KERNEL32.DLL. This behavior of searching files is captured by the use of these API calls. Rather than considering all API calls, we consider only critical API calls [8, 18]. The critical API calls are essential calls required for the core functionality of the program and will be required across all future versions. We extract the sequences of critical API calls from the program and represent them in a compactly as signatures. Next, we give details of how we perform each of the steps of our approach.

3.1 Disassembling the Program and Converting into IR

We use IDAPro Disassembler to disassemble the program binary. Next, we convert the disassembled instructions into an intermediate representation (IR) which is in the form of C-like statements. If these steps were being carried out on a parallel processor, multiple programs could be processed in parallel and make them ready for analysis. Thus, overhead of these steps is negligible for analysis. In addition, these steps could be done offline on a cloud cluster or high-end server.

3.2 Extracting the Behavior

Each program behavior can be specified by an ordered sequence of API calls. The signature of a program is hence based on the sequences of critical API calls. Let the set \(P=(a_1,a_2,\ldots ,a_n)\) be a profile created from a program by extracting its critical API calls, where \(a_i\) represents the ith critical API call and n being the total number of critical API calls.

To get the sequence of statements for each critical API call, we compute the backward trace of the program with respect to set P. A trace is a subset of a program, which contains only those statements that evaluate the arguments input to the concerned tracing criterion. In our case, the tracing criterion contains the set of critical API calls P. In our approach, we need not generate the trace for all the critical API calls in P, since if the trace generated for \(a_i\) criterion contains any of the API call say \(a_j\) in P, then we need not perform tracing on \(a_j\), since \(a_i\) already contains all the trace statements that \(a_j\) should contain. Therefore, the number of traces will be less than n, say m. The algorithm for generating the behavior traces is shown in Algorithm 1.

figure a

3.3 Computing the Longest Common Subsequence (LCS)

In this section, we compute the signature for a program by computing the LCS for each backward trace generated from the previous section. The signature is generated by computing the LCS of each trace from some versions of the program. For example, to create a signature for MyDoom family, we consider MyDoom.a and MyDoom.aa as training samples. The algorithm for generating the signature is shown in Algorithm 2. While we have considered malware samples, without loss of generality, this approach applies to any program consisting of several versions.

figure b

3.4 Classification Strategy

We use a statistical profiling technique to differentiate between programs. We measure the difference between the proportions of the critical API calls in a signature and that of a test program using the Chi-square test [19]. The chi-square test is a likelihood-ratio or maximum-likelihood statistical significance test that measures the difference between proportions in two independent samples. The signature \(SIG_i\) for a program \(M_i\) specifies the LCS of critical API calls that a version of the program which belongs to \(M_i\) is expected to have. To test the relationship of a given test file to \(M_i\), its API calls are extracted and compared to that in the signature. The chi-square is then computed as:

$$\begin{aligned} \chi _i^2 = \frac{( O_i - E_i )^2}{E_i} \qquad ;1 \le i \le n. \end{aligned}$$

Here, \(O_i\) is the observed frequency/LCS of the ith critical API call in the test file and \(E_i\) is its expected frequency, i.e., frequency in the signature of the program. Now, \(\chi ^2\) is compared against a threshold value \(\epsilon \) from a standard chi-square distribution table with one degree of freedom. The degrees of freedom are associated with the number of parameters that can vary in a statistical model. A significance level of 0.05 was selected. This means that 95 % of the time we expect \(\chi ^2\) to be less than or equal to \(\epsilon \). For one degree of freedom and significance level 0.05, \(\epsilon = 3.84\). Let \(U = \lbrace API_i \ | \ \chi _i^2 \le \epsilon _i \rbrace \). We define a degree of membership \(\lambda \) as

$$\begin{aligned} \lambda = \frac{|U|}{n}. \end{aligned}$$

Degree of membership \(\lambda \) is a measure of similarity of a test file to the given program. The statistical profiling algorithm is shown in Algorithm 3.

figure c

4 Experimental Analysis

To test the resilience of our approach against program obfuscations, we tested with the following known obfuscation techniques. Code Transposition: Code is shuffled and unconditional jumps are inserted to restore the original order of execution of program. Instruction Substitution: Many instruction sequences can be rewritten with similar semantics. A number of such transformations were identified and replaced. Register Reassignment: Usage of registers is interchanged, for example, EAX is used as EBX and vice versa. Dead Code Insertion: Code is added to program without modifying its behavior. To challenge our technique, we added code that consisted of critical API calls that would result in no change of behavior. For each program sample, a random number of program transformations were applied using the obfuscator. A set of 1000 obfuscated binaries for the sample program were generated. The signature was generated using only the sample program binary that was used to generate the 1000 obfuscated binaries. All experiments were conducted on an Intel Celeron 1.6 GHz, 1 GB RAM machine.

4.1 Synthetic Program Similarity Test

The experiment was carried out on 245 samples of malicious programs using the synthetic obfuscations described earlier. The observations made during the experiment were:

  • Our approach was resilient to control-flow transpositions. There was no effect on the critical API call sequences that were generated using backward tracing.

  • Our approach was resilient to instruction substitution transposition. In our approach, one of the steps is to convert the disassembly into a C-like intermediate representation, where such substitutions are handled and replaced by a code sequence which is kept same for the whole program.

  • Our approach was resilient to register reassignment transposition and dead code insertion. In both these techniques, the data dependencies were not changed. Therefore, our backward tracing analysis results were same.

4.2 Real-Life Program Similarity Test

In this experiment, semantically obfuscated program binaries of a sample were tested against the signature of the original program from which the semantically obfuscated binaries were generated. The experiment was carried out on two programs—notepad.exe and mstsc.exe. The API call sequences of a benign program were generated using backward tracing. Incrementally, some dummy noise API calls were introduced in these API call sequences. The following observations were made.

Observation 1. At some point, adding noise to API call sequences caused the program to be not detected as a variant of the original program. For the tested programs, notepad.exe and mstsc.exe, after about 20 % increase in the API call sequences length, our approach could not identify them as similar to original sequence. One of the reason for this was the threshold that had been kept for similarity detection.

Observation 2. Given, two programs, our program was able to provide an estimate percentage of similarity between those two programs based on the similarity in the API call sequences of the two programs. This metric can be used to cluster similar programs, especially, when analyzing a large number of programs.

5 Conclusion and Future Work

In this paper, we presented a novel static analysis approach to characterize program behaviors using behavior tracing We observed that the calling sequence relationship in API calls cannot be removed without changing the behavior of the program. Thus, calling sequences of API calls are able to capture program behavior to a large extent. To be able to accurately evaluate our approach, we developed an obfuscator that would apply random syntactic transformations to generate an obfuscated binary from the original binary.

Our approach is able to generate a signature, which is sufficient to characterize any number future versions of the program. Our approach is able to classify programs on the basis of behavioral similarity and also to provide an approximate measure of the quality of obfuscations. We provided the results to show the efficacy of our approach in classifying similar programs. As future work, we intend to use our approach in clustering programs on the basis of behavior.