Keywords

1 Introduction

Internet has become the most popular medium of communication and global information reservoir. With the increasing popularity of public social networking sites, the whole universe seems to congregate around internet to get his/her share of web. Though the general impression is the growing cyber security awareness among the masses, but the advanced hacker techniques and sophistication seems to counter the defensive mechanisms easily and befool the users.

Malicious web contents today primarily target web clients with browser vulnerabilities. Particularly, Drive-by-downloads [1] are specific types of web based client-side attacks in which a web browser requests web pages from remote web server. As a response, the server returns a webpage to the browser that contains attack code to exploit the web browser’s remote code execution vulnerability. If the malware is not delivered as part of the attack code’s payload, a special payload called downloader can optionally first pull and then execute malware on the local workstation. The entire attack happens without the users consent or notice. These attacks normally take advantage of tight coupling of browser plug-ins with browser environment. The memory of browser is physically shared with its various extensions thus making it highly susceptible to heap spray [2] or other similar attacks. The deterministic heap behavior causes the attacker to reliably assume the complete control of browser memory and eventually the entire system.

Detection domain of malicious websites primarily focuses on following strategies:

  1. (a)

    Browser Built-in Protection

    Browser Protection Plug-ins [3], Safe-Browsing like Google [4]

  2. (b)

    Static and Machine Learning Approaches

    JavaScript Features [5, 6], HTML and URL Structural Processing [6] and HTTP Communication Patterns [7], Pattern-Matching [8]

  3. (c)

    Memory Monitoring

    Memory Corruption and Heap Spray Detection [9], Data Memory Protection [10]

  4. (d)

    Emulation-Based Mitigation Technique

    Browser Emulation with HTTP response verification, Sandboxing the Script Execution and Result Verification [11]

  5. (e)

    Impact Learning

    Monitoring downloaded content correlated with User Events [12], Un-consented Content Execution Prevention [13].

  6. (f)

    HoneyClients

    • Low Interaction Honeyclients. HoneyC [14], PhoneyC [15], Honeysift [16], Monkey-Spider [17], Honeyware [18]

    • High Interaction Honeyclient. Capture HPC, Honeyclient, HoneyMonkey, Shelia, UW Spycrawler, WEF.

UAC (URL Analyzer and Classifier) is a lightweight solution that leverages static analysis combined with run-time emulation to identify malicious web pages. It performs inspection of a web page from multiple dimensions, which includes DOM parsing to identify potentially suspicious DOM elements including hidden iframes and malicious links, JavaScript analysis to detect obfuscated and malicious behavior, dynamic domain redirection tracking and scanning for suspicious patterns. UAC has the following features to offer:

Hybrid Analysis Framework. UAC offers hybrid analysis capability to counter the hidden techniques employed by Blackhat and to cover reasonable analysis domain. Run-time emulation facilitates safe inspection environment and exposure of dynamic behavior whereas employment of static analysis offers fast investigation.

Light-weight Approach. It has been tested with respect to system and performance measurements and has proved to incur less overhead. It demands minimum system resources and take around 20 s for each analysis.

Supervised Learning-based model. The JavaScript analysis and its behavioral profiling are based on supervised learning models to deliver accurate results.

Distributed Deployment. The solution has been deployed as a Low Interaction Honeyclient at various geographical locations to permit distributed load balancing and capturing of targeted attacks (Region-specific attacks).

Scalable Solution. The hash-based technique to eliminate the process of redundant URL analysis has been integrated. Also, the architectural implementation ensures that the analysis is done at client-side and the analysis results are mapped to central server which reduces transmission load and also consumes less network bandwidth.

Evaluated Version. It has been evaluated against various open-source Low Interaction Honeyclients and also with Google–Safe browsing. The results depict that UAC is very effective in detecting malicious URLs with a very low false positive rate of 0.2 % and false negative rate of 0.08 %.

2 Related Work

Caffeine Monkey [19] is a Client-Side Honeypot technology to identify browser exploitation. It employs a JavaScript de-obfuscator, logger, and profiler to identify malicious websites. JavaScript behavioral analysis is based on its function-call analysis. Whereas the common aspect of Caffeine Monkey and UAC is the use of function calls for JavaScript analysis, the significant difference lies in the selection of function calls. UAC makes use of 33 JavaScript function calls, which have been selected after rigorous experimentations on various websites that download malware.

Binspect [20] makes use of emulation and static analysis to detect Drive-by-Download and phishing attacks. It employs machine learning models based on URL features, Page-Source features (HTML and JavaScript), and Social-Reputation features. UAC however analyzes the web page from the behavioral features rather than structural features for more accurate interpretation.

ZOZZLE [21], a fast and precise in-browser JavaScript Malware Detection is based on static JS analysis using function-call hooking in browser JS engine. Bayesian classification of hierarchical features in the form of JavaScript abstract syntax tree is used to identify syntax elements that are highly predictive of malware. However, it primarily addresses No-op and heap spray attacks. The obfuscation detection of JavaScript in UAC is primarily derived from “Automatic Detection for JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis” [22] that makes use of n-grams, entropy and string length to identify obfuscation in scripts.

Jstill [23] enables detection of obfuscated JavaScript and function invocation based analysis to detect malicious JavaScript. It also highlights the discrepancies of browser-based mechanisms. However, the analysis is based on inspecting arguments of function calls that are dynamically invoked. UAC, on the other hand makes use of the statistical and sequential features inherent in function call invocation, where obfuscation detection is done in a separate thread.

“Knowing your enemy: understanding and detecting malicious web advertising” [24] has developed Mad Tracer for Spam, Drive-By-Downloads, and Click Frauds. It analyzes hidden iFrame injections and redirections. UAC also provides information of iFrames and malicious links but it identifies all iFrame and analyzes them according to their visibility index and structure. In addition, it also identifies suspicious links on a web page.

3 Problem Definition and Approach Adopted

Being a type of client-side attack, detection of Drive-By-Download attacks needs to be addressed at client-side. The problem statement can be stated as the development of Client Honeypot for (a) Overcoming the challenge of multiple browser-OS combinations to detect actual system exploit (b) Capturing static and dynamic webpage contents (c) Inspection of dynamic JavaScript behavior to detect mal-code and/or redirections (d) Large-scale deployment of the analysis mechanism which demands a low-overhead and fast approach, in addition to addressing scalability.

3.1 Approach Adopted

To address the above problem statement, UAC has been developed which employs emulated browser and JavaScript engine that facilitates the execution of URLs and JavaScripts in safe emulated environment without the need to configure browser-specific environment. Use of emulation enables the capturing of static and run-time (dynamically) generated web contents including potentially malicious iframes and links. Use of JavaScript engine enables the inspection of dynamic JavaScript behavior thus defeating the mechanisms of obfuscation and other code-hiding techniques used by attackers. Various Challenges and their solutions provide an overview of the approach adopted:

3.2 Challenge 1: Overcoming the Challenge of Multiple Browser-OS Combinations to Detect Actual System Exploit

UAC is a browser-independent solution that utilizes emulated browser and JavaScript engine to facilitate the execution of URLs and JavaScripts in a safe emulated environment (protected from self-exploitation) without the need to configure browser-specific environment.

3.3 Challenge 2: Capturing Available and Generated (Static and Dynamic) Webpage Contents

Execution of URL using browser that is configured with DOM parser and JavaScript engine permits monitoring of static and run-time web contents including likely malicious iframes, links, and invoked scripts.

3.4 Challenge 3: Transient Malware Compromises Effectiveness of Static Analysis

Transient JavaScript malware can be effectively monitored during run-time where it renders its actual behavior. Hybrid analysis technique (static and run-time) is employed in UAC that exposes the dynamic behavior of webpage.

3.5 Challenge 4: Inspection of Dynamic JavaScript Behavior to Detect Mal-code and/or Redirections

Use of JavaScript engine in UAC enables the inspection of dynamic JavaScript behavior thus defeating the mechanisms of obfuscation and other code-hiding techniques used by attackers.

3.6 Challenge 5: Establish Significant (Legitimate and Illegitimate) JS Function-calls

Thirty three JavaScript function calls have been selected after rigorous experimentations (using commercial sandbox) on JavaScripts extracted from sites that drop malware. These function calls exhibit the most frequent occurrences in suspicious web sites.

3.7 Challenge 6: Scalability Aspects

Hash-based redundancy check has been applied in UAC to prevent redundant URL analysis.

4 UAC Design

Figure 1 illustrates the design of UAC in which the input is a set of seed URLs which are further crawled and then analyzed. The input URLs are executed using emulated browser and relevant parameters are captured. UAC declares any site as “Likely Suspicious”, “Suspicious”, “Highly Suspicious”, “Benign”, and “Error”. This classification is based on final rule-set generated after URL analysis.

Fig. 1
figure 1

UAC modular design

4.1 URL Active Crawling

The active URL hunt is done using a web crawler that extracts web links from a given web-page. URL crawling pursues standard algorithm that downloads website contents and extract links based on recognized patterns.

An important challenge in the implementation of web crawler is the selection of an optimum crawling depth. If depth is too low, associated crawling becomes limited to few sites. Large crawling depth produces an enormous overhead and becomes the bottleneck in the whole analysis process. Table 1 summarizes the output of various experimentations that were carried out to select the most suitable depth value. The processing overhead incurred by web crawler on system can be averaged as:

Table 1 Crawling depth selection
  • Time Consumption: 0.033 s/URL (Average)

  • Memory Consumption: 7.86 kb/URL (Average)

From the table it can be concluded that Depth Value of 2 maintains a balance between detection rate and processing overhead. However, user is provided with an option to select crawling depth between 0 and 3, according to his needs during analysis.

4.2 Hash-Based Redundancy Checker

UAC is implemented as a distributed system i.e. deployed at various geographical locations to capture location-specific attacks and to enable load distributions during peak operations. To scale the system, the initial URL seeding is implemented in the form of hash structures to prevent redundant URL search. Major DOM elements like <a>, <base>, <body>, <button>, <command>, <datalist>, <div>, <embed>, <form>, <iframe>, <li>, <link>, <object>, <source>, <internal script>, <external script>, <asynchronous script>, etc. are parsed as shown in Fig. 2.

Fig. 2
figure 2

Hash-based redundancy checker

These DOM elements have been cataloged based on dynamicity and impact that these exhibit on any website. These values are then converted into hash structure in the form of a string key value. The hash map data structure directly maps a given key (extracted after parsing the DOM structure of site) to classification if it has been previously analyzed (and so no need of further analysis). If no matching key is found, the hash table is updated with the new generated key. The updated hash table is mapped to each distributed location on a regular basis.

4.3 Hybrid Analysis Mechanism

In order to capture the actual behavior of the website, it is recommended that the site be executed in emulated browser, if not real one. This enables us to capture the run-time behavioral aspects of URL. For this, e-links text browser [25] has been deployed which is an open-source terminal emulator. The browser is further configured with SpiderMonkey [26] JavaScript engine which is responsible for rendering and exposing component object model for JavaScripts. However, the browser and JavaScript engine functionality is utilized only to extract relevant analysis parameters to be later evaluated as shown in Fig. 3.

Fig. 3
figure 3

Hybrid analysis process of UAC

4.4 DOM Parsing to Detect Suspicious DOM Elements

The DOM parser, as shown in Fig. 4, monitors all the website components that become part of DOM during URL execution. The DOM of any website defines the complete structure of the site. DOM elements may exist statically or may be generated dynamically. DOM parser scans for following suspicious elements.

Fig. 4
figure 4

DOM elements scanned

  1. 1.

    Potentially Malicious iFrames

    Iframes add redirections to any site and these iframes are either present as static DOM elements on compromised sites or as dynamic DOM elements through malicious dynamic script injections. Following iframes are considered to be potentially suspicious and are extracted:

    • Hidden iframes (with visibility index ranging from 0 to 2)

    • Likely Malicious Iframes of the form http://foreigndomain.com/location/resource_id=? which are normally involved in delivering information to third parties or as a means of exchanging some kind of identification.

  2. 2.

    Potentially Malicious Links

    • Links containing executable file extensions like .exe or .dll etc. that lead to binary drop on system.

    • Links of the form http://foreigndomain.com/location/resource_id=? which are potentially suspicious because of the reasons stated above. All these links are initially filtered based on a Whitelist (top rated benign sites) and then populated to database as potentially suspicious links.

4.5 JavaScript Analysis

JavaScripts add dynamicity to a website because they are dynamically executed by the browser at the time of URL visit. Browsers are generally incorporated with a JavaScript engine that renders the code for a site. Due to their dynamic nature, JavaScripts are responsible for more than 80 % of web attacks that involve client-side exploitation. Hence, they form critical part of web contents to be analyzed exhaustively. Following analysis is performed on the JavaScripts extracted from site:

  1. 1.

    Obfuscation Detection

    Obfuscation is the means of hiding the actual intent of the script through application of techniques that encrypt the plain-text. This detection is significant since most of the malicious scripts are obfuscated to easily evade signature detection or even manual analysis. Figure 5 depicts an obfuscated script received during analysis.

    Fig. 5
    figure 5

    Obfuscated JavaScript sample

    The obfuscation detection is based on following parameters:

    1. (a)

      N-grams Mining

      • 1-gram distribution is computed for each of following characters in JavaScripts:

        • normal characters (u and x)

        • numeric characters (0–9)

        • special symbols (@,#,$,%, etc.)

      • There exists a high density concentration of the above characters in obfuscated scripts and hence their frequency distribution is useful.

    2. (b)

      Entropy

      • The arguments of significant JavaScript function calls (found in malicious JavaScript) are captured and their entropy is calculated. Entropy is an indication for the information gain. The use of obfuscated strings greatly reduces the entropy and hence entropy calculation is important. Entropy is calculated based on Shannon entropy concept [27] with the following formula:

        $$E(B) = - \sum\limits_{i = 1}^{N} {\left( {\frac{{b_{i} }}{T}} \right)} \log \left( {\frac{{b_{i} }}{T}} \right)\left\{ {\begin{array}{*{20}l} {B = \{ b_{i} ,\quad i = 0,1 \ldots N\} } \hfill \\ {T = \sum\limits_{i = 1}^{N} {b_{i} } } \hfill \\ \end{array} } \right.$$
    3. (c)

      Entropy Density

      • Entropy density is an important parameter since only entropy sometimes may not be able to provide complete information. The distribution of the entropy over the whole range of input bytes is significant and hence the entropy density is calculated based on:

        $${\text{Entropy}}\;{\text{Density}} = {\text{Entropy}}/{\text{String}}\;{\text{length}}$$
    4. (d)

      Longest Word Length

      • Obfuscated strings generally utilize larger lengths because they have larger hexadecimal (or otherwise) distribution to represent any single character. All the above parameters are extracted and compared against machine-learned model. The model has been generated after due training using both benign and malicious samples. Trees-Random forest [28] is the learning algorithm employed in UAC which has been selected after intensive experimentations on the dataset using various learning algorithms. The selected algorithm provides least false positives and false negatives (as depicted in confusion matrix) during training. Table 2 provides an overview of the criteria used for selection of machine learning algorithms for various analysis mechanisms.

        Table 2 Selection criteria for machine learning algorithm
  2. 2.

    JavaScript Behavioral Profiling

    Obfuscation is just an indication of the malicious intent. However, the actual behavior still remains to be identified. The behavioral profiling of the JavaScript is done based on significant function (API) calls. Thirty three significant function calls have been selected after excessive experimentations on all those sites that drop malware (the malware drop declared using commercial sandbox analysis), which primarily include eval, unescape, concatstring, undependstring, execute, setproperty, and so on. Also the function calls selected from malicious websites are further optimized based on comparison with those function calls that are mostly employed by benign sites. Following analysis process is performed on these calls:

    1. (a)

      Frequency Mining of Function Calls

      The frequency distribution of (short-listed) function calls in the JavaScripts extracted from websites is computed. A numeric reference-id is provided to each function call and the distribution is compared with a machine-learned model. The model has been generated after due training using both benign and malicious samples. Experimentations have been performed using various learning algorithms on the derived dataset. Meta-Rotation forest [29] is the learning algorithm that provides effective true positive and negative values.

    2. (b)

      Sequence Mining of Function Calls

      To determine the sequential behavior of the function calls, they are grouped into logical categories based on their functionality. Table 3 provides an insight into 13 such groups that have been identified. The grouping is important since if we want to trace the sequential function call behavior, we need to trace the functionality aspect irrespective of the type of call employed. For instance, string manipulation can be performed using numerous different calls. After the division of the calls under their logical heads, the sliding window sequence is generated with window-size = 5. This size has been selected after performing experimentations with window size of 2, 5, 10, 15, 20, 25, and 30. Trees-Random forest [28] is the learning algorithm used for classification.

      Table 3 Function calls profiling for malicious JavaScript analyzed by UAC

4.6 Signature Scanning

The HTML and extracted JavaScript contents are scanned against malicious signatures which have included from following sources:

  1. (a)

    Self-Crafted Signatures

    Currently 5 such signatures exist, which have been formulated from all instances of JavaScripts extracted from Drive-by-Download websites.

  2. (b)

    iScanner Signatures

    iScanner [30] specifically contains the signatures to detect malicious strings in HTML DOM and JavaScripts.

  3. (c)

    Snort Signatures

    Snort content-based JavaScript signatures [31] have been included in UAC.

  4. (d)

    Honeysift Signatures

    Honeysift [16] is a low interaction Honeyclient which provides 19 malicious signatures for JavaScript.

4.7 Redirection Domains and DOM Structural Graph

UAC provides an additional output of all the redirections that were dynamically and automatically generated during URL visit. The domain information is extracted using DNS transactions. These provide an overview of the all sites involved in the infection cycle for any given malicious site. This information provides significant domain redirection chain to incident-handling agencies.

DOM Structural graphs can also be visualized in a tree structure form for every URL which gives details of the DOM elements. It provides information regarding the placement of DOM elements in any site. The graphs are generated in PNG format for every analyzed site.

4.8 Parallel Evaluation

UAC performs parallel evaluation with Google-Safebrowsing for every URL and the results are presented to the user on the same analysis console. The last date of Google validation for any site is also included. Google declares website as suspicious or benign and also provides additional information like domains acting as intermediaries for malware distribution, or the websites that are actively involved in transmitting infections. This facilitates benchmarking and comparison with UAC results.

4.9 Distributed Deployment

UAC is implemented as a Low interaction Honeyclient and has been integrated in Distributed Honeynet System (DHS). Currently, DHS nodes are operational at eight geographical locations across India. The distributed deployment is done through implementation of UAC as a virtual machine in DHS client node. The central analysis server performs the load balancing and load distribution to various nodes depending upon URL list.

The actual analysis is performed at the client and the results are mapped to a central analysis server on regular basis. This significantly reduces the transmission overhead and consumes less bandwidth and memory. Also, this system minimizes the operating cost of server.

5 Experimentations and Evaluations

5.1 Performance Measurement (Standalone Systems)

See Tables 4 and 5.

Table 4 Performance measurement of UAC
Table 5 UAC system measurements

5.2 Performance Measurements (Distributed Systems)

See Table 6.

Table 6 UAC aspects for distributed deployments

5.3 Evaluations with Respect to Other Low Interaction Honeyclient

UAC has been evaluated against other open-source Low interaction Honeyclients with respect to feature-set and analysis capabilities. Table 7 presents the comparison results and depicts the effectiveness of UAC in detecting large number of malicious URLs.

Table 7 Comparison of UAC with other Low Interaction Honeyclients

5.4 Experimental Evaluations

List of Potentially malicious sites were derived from various sources including Cert-In. These sites are analyzed by UAC and the results have been shared with incident response group. This also aids in validation of UAC results. Following statistics have been generated from these experimentations (Table 8).

Table 8 Experimental Evaluation of UAC

5.5 Multi-threading Approach

A multi-threaded application permits a still faster execution of UAC. However, multi-threading exploits the parallelism inherent in the program itself. Table 9 provides an overview of the various stages in UAC that are candidates for multiple thread execution.

Table 9 Mutli-threading process in UAC

The performance improvement using multiple threads is directly visible from following performance measurements:

 

Latency (s)

Throughput (URLs/h)

With threading

12

300

Without threading

20

180

6 Towards Signature Formulation

Anti-virus scanners detect attacks based on their signature database. With the ever growing diversification in the attack code, it becomes a useful and desirable activity to generate signatures for the unknown attacks. However, the main goal of our approach is to update the signature database of open-source community anti-virus i.e. Clamav.

All the JavaScripts that are declared malicious by UAC are further validated by submission to Virus-total portal to determine if popular anti-virus scanners also label them as malicious. The developed automated mechanism for signature generation filters out all the scripts which are labeled as malicious by popular antivirus engines but not by clamav. Subsequently, hexadecimal and hash-based signatures are generated for the filtered JavaScript. These are eventually populated in clamav to enhance its signature repository. This activity is a continual process to permit the regular enrichment of open-source signatures repository.

7 Conclusion and Future Work

UAC is a novel approach towards distributed and scalable analysis of URLs which leverages the significance of dynamic execution (through emulation) and static analysis. UAC inspects the webpage from various perspectives including suspicious DOM parsing and JavaScript analysis and attempts to cover maximum analysis domain. Also, other popular dynamic client side scripts like Jscripts are accommodated in our analysis easily because they are based on ECMA standards [32] and SpiderMonkey interprets ECMA scripts. We have even manually analyzed URLs declared as benign by UAC to identify the reasons of failures and found that in most of the sites, the infection is already removed by the time it is analyzed by UAC. However, certain other analysis processes like integration of file analyzers including SWF, PDF, etc. can be integrated for further inspecting the complete downloaded web-code. Also, in some websites, we happened to come across malware injected in the form of VB scripts, which is currently not included in our scope.

The distributed crawling is the area that we can pursue further making use of facilities like grid computing to perform large-scale analysis. Also, the whole application can be ported on a High performance computing infrastructure to optimize the speed and levels of performance for distributed computing.