UAC: A Lightweight and Scalable Approach to Detect Malicious Web Pages

Kaur, Harneet; Madan, Sanjay; Sehgal, Rakesh Kumar

doi:10.1007/978-3-319-06740-7_21

Harneet Kaur⁷,
Sanjay Madan⁷ &
Rakesh Kumar Sehgal⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 285))

1012 Accesses

Abstract

Attackers mostly target users with vulnerable browsers thus inducting client side attacks through various exploitation means, where dynamic client-side JavaScript is most instrumental. In this paper, we present UAC (URL Analyzer and Classifier), a novel lightweight and browser-independent solution that leverages static analysis combined with run-time emulation to identify malicious web pages. UAC performs multi-facet inspection of web page which includes DOM parsing to identify suspicious DOM elements including hidden iframes and malicious links, JavaScript analysis to detect obfuscated and malicious behavior using function-call profiling based on supervised learning, tracking dynamic domain redirections and scanning for suspicious patterns. An Active potential URL hunt to seed web pages is conducted using an integrated web crawler to cover the maximum cyber space for a given URL. The solution is employed as a Low Interaction Honeyclient in a Distributed Honeynet System where the scalability is addressed using a hash-based redundancy check.

Access provided by Autonomous University of Puebla. Download conference paper PDF

MalCrawler: A Crawler for Seeking and Crawling Malicious Websites

BINSPECT: Holistic Analysis and Detection of Malicious Web Pages

Understanding Evasion Techniques that Abuse Differences Among JavaScript Implementations

Keywords

1 Introduction

Internet has become the most popular medium of communication and global information reservoir. With the increasing popularity of public social networking sites, the whole universe seems to congregate around internet to get his/her share of web. Though the general impression is the growing cyber security awareness among the masses, but the advanced hacker techniques and sophistication seems to counter the defensive mechanisms easily and befool the users.

Malicious web contents today primarily target web clients with browser vulnerabilities. Particularly, Drive-by-downloads [1] are specific types of web based client-side attacks in which a web browser requests web pages from remote web server. As a response, the server returns a webpage to the browser that contains attack code to exploit the web browser’s remote code execution vulnerability. If the malware is not delivered as part of the attack code’s payload, a special payload called downloader can optionally first pull and then execute malware on the local workstation. The entire attack happens without the users consent or notice. These attacks normally take advantage of tight coupling of browser plug-ins with browser environment. The memory of browser is physically shared with its various extensions thus making it highly susceptible to heap spray [2] or other similar attacks. The deterministic heap behavior causes the attacker to reliably assume the complete control of browser memory and eventually the entire system.

Detection domain of malicious websites primarily focuses on following strategies:

(a)
Browser Built-in Protection

Browser Protection Plug-ins [3], Safe-Browsing like Google [4]
(b)
Static and Machine Learning Approaches

JavaScript Features [5, 6], HTML and URL Structural Processing [6] and HTTP Communication Patterns [7], Pattern-Matching [8]
(c)
Memory Monitoring

Memory Corruption and Heap Spray Detection [9], Data Memory Protection [10]
(d)
Emulation-Based Mitigation Technique

Browser Emulation with HTTP response verification, Sandboxing the Script Execution and Result Verification [11]
(e)
Impact Learning

Monitoring downloaded content correlated with User Events [12], Un-consented Content Execution Prevention [13].
(f)
HoneyClients
- Low Interaction Honeyclients. HoneyC [14], PhoneyC [15], Honeysift [16], Monkey-Spider [17], Honeyware [18]
- High Interaction Honeyclient. Capture HPC, Honeyclient, HoneyMonkey, Shelia, UW Spycrawler, WEF.

UAC (URL Analyzer and Classifier) is a lightweight solution that leverages static analysis combined with run-time emulation to identify malicious web pages. It performs inspection of a web page from multiple dimensions, which includes DOM parsing to identify potentially suspicious DOM elements including hidden iframes and malicious links, JavaScript analysis to detect obfuscated and malicious behavior, dynamic domain redirection tracking and scanning for suspicious patterns. UAC has the following features to offer:

Hybrid Analysis Framework. UAC offers hybrid analysis capability to counter the hidden techniques employed by Blackhat and to cover reasonable analysis domain. Run-time emulation facilitates safe inspection environment and exposure of dynamic behavior whereas employment of static analysis offers fast investigation.

Light-weight Approach. It has been tested with respect to system and performance measurements and has proved to incur less overhead. It demands minimum system resources and take around 20 s for each analysis.

Supervised Learning-based model. The JavaScript analysis and its behavioral profiling are based on supervised learning models to deliver accurate results.

Distributed Deployment. The solution has been deployed as a Low Interaction Honeyclient at various geographical locations to permit distributed load balancing and capturing of targeted attacks (Region-specific attacks).

Scalable Solution. The hash-based technique to eliminate the process of redundant URL analysis has been integrated. Also, the architectural implementation ensures that the analysis is done at client-side and the analysis results are mapped to central server which reduces transmission load and also consumes less network bandwidth.

Evaluated Version. It has been evaluated against various open-source Low Interaction Honeyclients and also with Google–Safe browsing. The results depict that UAC is very effective in detecting malicious URLs with a very low false positive rate of 0.2 % and false negative rate of 0.08 %.

2 Related Work

Caffeine Monkey [19] is a Client-Side Honeypot technology to identify browser exploitation. It employs a JavaScript de-obfuscator, logger, and profiler to identify malicious websites. JavaScript behavioral analysis is based on its function-call analysis. Whereas the common aspect of Caffeine Monkey and UAC is the use of function calls for JavaScript analysis, the significant difference lies in the selection of function calls. UAC makes use of 33 JavaScript function calls, which have been selected after rigorous experimentations on various websites that download malware.

Binspect [20] makes use of emulation and static analysis to detect Drive-by-Download and phishing attacks. It employs machine learning models based on URL features, Page-Source features (HTML and JavaScript), and Social-Reputation features. UAC however analyzes the web page from the behavioral features rather than structural features for more accurate interpretation.

ZOZZLE [21], a fast and precise in-browser JavaScript Malware Detection is based on static JS analysis using function-call hooking in browser JS engine. Bayesian classification of hierarchical features in the form of JavaScript abstract syntax tree is used to identify syntax elements that are highly predictive of malware. However, it primarily addresses No-op and heap spray attacks. The obfuscation detection of JavaScript in UAC is primarily derived from “Automatic Detection for JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis” [22] that makes use of n-grams, entropy and string length to identify obfuscation in scripts.

Jstill [23] enables detection of obfuscated JavaScript and function invocation based analysis to detect malicious JavaScript. It also highlights the discrepancies of browser-based mechanisms. However, the analysis is based on inspecting arguments of function calls that are dynamically invoked. UAC, on the other hand makes use of the statistical and sequential features inherent in function call invocation, where obfuscation detection is done in a separate thread.

“Knowing your enemy: understanding and detecting malicious web advertising” [24] has developed Mad Tracer for Spam, Drive-By-Downloads, and Click Frauds. It analyzes hidden iFrame injections and redirections. UAC also provides information of iFrames and malicious links but it identifies all iFrame and analyzes them according to their visibility index and structure. In addition, it also identifies suspicious links on a web page.

3 Problem Definition and Approach Adopted

Being a type of client-side attack, detection of Drive-By-Download attacks needs to be addressed at client-side. The problem statement can be stated as the development of Client Honeypot for (a) Overcoming the challenge of multiple browser-OS combinations to detect actual system exploit (b) Capturing static and dynamic webpage contents (c) Inspection of dynamic JavaScript behavior to detect mal-code and/or redirections (d) Large-scale deployment of the analysis mechanism which demands a low-overhead and fast approach, in addition to addressing scalability.

3.1 Approach Adopted

To address the above problem statement, UAC has been developed which employs emulated browser and JavaScript engine that facilitates the execution of URLs and JavaScripts in safe emulated environment without the need to configure browser-specific environment. Use of emulation enables the capturing of static and run-time (dynamically) generated web contents including potentially malicious iframes and links. Use of JavaScript engine enables the inspection of dynamic JavaScript behavior thus defeating the mechanisms of obfuscation and other code-hiding techniques used by attackers. Various Challenges and their solutions provide an overview of the approach adopted:

3.2 Challenge 1: Overcoming the Challenge of Multiple Browser-OS Combinations to Detect Actual System Exploit

UAC is a browser-independent solution that utilizes emulated browser and JavaScript engine to facilitate the execution of URLs and JavaScripts in a safe emulated environment (protected from self-exploitation) without the need to configure browser-specific environment.

3.3 Challenge 2: Capturing Available and Generated (Static and Dynamic) Webpage Contents

Execution of URL using browser that is configured with DOM parser and JavaScript engine permits monitoring of static and run-time web contents including likely malicious iframes, links, and invoked scripts.

3.4 Challenge 3: Transient Malware Compromises Effectiveness of Static Analysis

Transient JavaScript malware can be effectively monitored during run-time where it renders its actual behavior. Hybrid analysis technique (static and run-time) is employed in UAC that exposes the dynamic behavior of webpage.

3.5 Challenge 4: Inspection of Dynamic JavaScript Behavior to Detect Mal-code and/or Redirections

Use of JavaScript engine in UAC enables the inspection of dynamic JavaScript behavior thus defeating the mechanisms of obfuscation and other code-hiding techniques used by attackers.

3.6 Challenge 5: Establish Significant (Legitimate and Illegitimate) JS Function-calls

Thirty three JavaScript function calls have been selected after rigorous experimentations (using commercial sandbox) on JavaScripts extracted from sites that drop malware. These function calls exhibit the most frequent occurrences in suspicious web sites.

3.7 Challenge 6: Scalability Aspects

Hash-based redundancy check has been applied in UAC to prevent redundant URL analysis.

4 UAC Design

Figure 1 illustrates the design of UAC in which the input is a set of seed URLs which are further crawled and then analyzed. The input URLs are executed using emulated browser and relevant parameters are captured. UAC declares any site as “Likely Suspicious”, “Suspicious”, “Highly Suspicious”, “Benign”, and “Error”. This classification is based on final rule-set generated after URL analysis.

4.1 URL Active Crawling

The active URL hunt is done using a web crawler that extracts web links from a given web-page. URL crawling pursues standard algorithm that downloads website contents and extract links based on recognized patterns.

An important challenge in the implementation of web crawler is the selection of an optimum crawling depth. If depth is too low, associated crawling becomes limited to few sites. Large crawling depth produces an enormous overhead and becomes the bottleneck in the whole analysis process. Table 1 summarizes the output of various experimentations that were carried out to select the most suitable depth value. The processing overhead incurred by web crawler on system can be averaged as:

Table 1 Crawling depth selection

Full size table

Time Consumption: 0.033 s/URL (Average)
Memory Consumption: 7.86 kb/URL (Average)

From the table it can be concluded that Depth Value of 2 maintains a balance between detection rate and processing overhead. However, user is provided with an option to select crawling depth between 0 and 3, according to his needs during analysis.

4.2 Hash-Based Redundancy Checker

UAC is implemented as a distributed system i.e. deployed at various geographical locations to capture location-specific attacks and to enable load distributions during peak operations. To scale the system, the initial URL seeding is implemented in the form of hash structures to prevent redundant URL search. Major DOM elements like <a>, <base>, <body>, <button>, <command>, <datalist>, <div>, <embed>, <form>, <iframe>, <li>, <link>, <object>, <source>, <internal script>, <external script>, <asynchronous script>, etc. are parsed as shown in Fig. 2.

These DOM elements have been cataloged based on dynamicity and impact that these exhibit on any website. These values are then converted into hash structure in the form of a string key value. The hash map data structure directly maps a given key (extracted after parsing the DOM structure of site) to classification if it has been previously analyzed (and so no need of further analysis). If no matching key is found, the hash table is updated with the new generated key. The updated hash table is mapped to each distributed location on a regular basis.

4.3 Hybrid Analysis Mechanism

In order to capture the actual behavior of the website, it is recommended that the site be executed in emulated browser, if not real one. This enables us to capture the run-time behavioral aspects of URL. For this, e-links text browser [25] has been deployed which is an open-source terminal emulator. The browser is further configured with SpiderMonkey [26] JavaScript engine which is responsible for rendering and exposing component object model for JavaScripts. However, the browser and JavaScript engine functionality is utilized only to extract relevant analysis parameters to be later evaluated as shown in Fig. 3.

4.4 DOM Parsing to Detect Suspicious DOM Elements

The DOM parser, as shown in Fig. 4, monitors all the website components that become part of DOM during URL execution. The DOM of any website defines the complete structure of the site. DOM elements may exist statically or may be generated dynamically. DOM parser scans for following suspicious elements.

1.
Potentially Malicious iFrames

Iframes add redirections to any site and these iframes are either present as static DOM elements on compromised sites or as dynamic DOM elements through malicious dynamic script injections. Following iframes are considered to be potentially suspicious and are extracted:
- Hidden iframes (with visibility index ranging from 0 to 2)
- Likely Malicious Iframes of the form http://foreigndomain.com/location/resource_id=? which are normally involved in delivering information to third parties or as a means of exchanging some kind of identification.
2.
Potentially Malicious Links
- Links containing executable file extensions like .exe or .dll etc. that lead to binary drop on system.
- Links of the form http://foreigndomain.com/location/resource_id=? which are potentially suspicious because of the reasons stated above. All these links are initially filtered based on a Whitelist (top rated benign sites) and then populated to database as potentially suspicious links.

4.5 JavaScript Analysis

JavaScripts add dynamicity to a website because they are dynamically executed by the browser at the time of URL visit. Browsers are generally incorporated with a JavaScript engine that renders the code for a site. Due to their dynamic nature, JavaScripts are responsible for more than 80 % of web attacks that involve client-side exploitation. Hence, they form critical part of web contents to be analyzed exhaustively. Following analysis is performed on the JavaScripts extracted from site:

1.
Obfuscation Detection

Obfuscation is the means of hiding the actual intent of the script through application of techniques that encrypt the plain-text. This detection is significant since most of the malicious scripts are obfuscated to easily evade signature detection or even manual analysis. Figure 5 depicts an obfuscated script received during analysis.
Fig. 5
Obfuscated JavaScript sample
Full size image

The obfuscation detection is based on following parameters:
1. (a)
  N-grams Mining
  - 1-gram distribution is computed for each of following characters in JavaScripts:
    - normal characters (u and x)
    - numeric characters (0–9)
    - special symbols (@,#,$,%, etc.)
  - There exists a high density concentration of the above characters in obfuscated scripts and hence their frequency distribution is useful.
2. (b)
  Entropy
  - The arguments of significant JavaScript function calls (found in malicious JavaScript) are captured and their entropy is calculated. Entropy is an indication for the information gain. The use of obfuscated strings greatly reduces the entropy and hence entropy calculation is important. Entropy is calculated based on Shannon entropy concept [27] with the following formula:
    $$E(B) = - \sum\limits_{i = 1}^{N} {\left( {\frac{{b_{i} }}{T}} \right)} \log \left( {\frac{{b_{i} }}{T}} \right)\left\{ {\begin{array}{*{20}l} {B = \{ b_{i} ,\quad i = 0,1 \ldots N\} } \hfill \\ {T = \sum\limits_{i = 1}^{N} {b_{i} } } \hfill \\ \end{array} } \right.$$
3. (c)
  Entropy Density
  - Entropy density is an important parameter since only entropy sometimes may not be able to provide complete information. The distribution of the entropy over the whole range of input bytes is significant and hence the entropy density is calculated based on:
    $${\text{Entropy}}\;{\text{Density}} = {\text{Entropy}}/{\text{String}}\;{\text{length}}$$
4. (d)
  Longest Word Length
  - Obfuscated strings generally utilize larger lengths because they have larger hexadecimal (or otherwise) distribution to represent any single character. All the above parameters are extracted and compared against machine-learned model. The model has been generated after due training using both benign and malicious samples. Trees-Random forest [28] is the learning algorithm employed in UAC which has been selected after intensive experimentations on the dataset using various learning algorithms. The selected algorithm provides least false positives and false negatives (as depicted in confusion matrix) during training. Table 2 provides an overview of the criteria used for selection of machine learning algorithms for various analysis mechanisms.
    Table 2 Selection criteria for machine learning algorithm
    Full size table
2.
JavaScript Behavioral Profiling

Obfuscation is just an indication of the malicious intent. However, the actual behavior still remains to be identified. The behavioral profiling of the JavaScript is done based on significant function (API) calls. Thirty three significant function calls have been selected after excessive experimentations on all those sites that drop malware (the malware drop declared using commercial sandbox analysis), which primarily include eval, unescape, concatstring, undependstring, execute, setproperty, and so on. Also the function calls selected from malicious websites are further optimized based on comparison with those function calls that are mostly employed by benign sites. Following analysis process is performed on these calls:
1. (a)
  Frequency Mining of Function Calls
  
  The frequency distribution of (short-listed) function calls in the JavaScripts extracted from websites is computed. A numeric reference-id is provided to each function call and the distribution is compared with a machine-learned model. The model has been generated after due training using both benign and malicious samples. Experimentations have been performed using various learning algorithms on the derived dataset. Meta-Rotation forest [29] is the learning algorithm that provides effective true positive and negative values.
2. (b)
  Sequence Mining of Function Calls
  
  To determine the sequential behavior of the function calls, they are grouped into logical categories based on their functionality. Table 3 provides an insight into 13 such groups that have been identified. The grouping is important since if we want to trace the sequential function call behavior, we need to trace the functionality aspect irrespective of the type of call employed. For instance, string manipulation can be performed using numerous different calls. After the division of the calls under their logical heads, the sliding window sequence is generated with window-size = 5. This size has been selected after performing experimentations with window size of 2, 5, 10, 15, 20, 25, and 30. Trees-Random forest [28] is the learning algorithm used for classification.
  Table 3 Function calls profiling for malicious JavaScript analyzed by UAC
  Full size table

4.6 Signature Scanning

The HTML and extracted JavaScript contents are scanned against malicious signatures which have included from following sources:

(a)
Self-Crafted Signatures

Currently 5 such signatures exist, which have been formulated from all instances of JavaScripts extracted from Drive-by-Download websites.
(b)
iScanner Signatures

iScanner [30] specifically contains the signatures to detect malicious strings in HTML DOM and JavaScripts.
(c)
Snort Signatures

Snort content-based JavaScript signatures [31] have been included in UAC.
(d)
Honeysift Signatures

Honeysift [16] is a low interaction Honeyclient which provides 19 malicious signatures for JavaScript.

4.7 Redirection Domains and DOM Structural Graph

UAC provides an additional output of all the redirections that were dynamically and automatically generated during URL visit. The domain information is extracted using DNS transactions. These provide an overview of the all sites involved in the infection cycle for any given malicious site. This information provides significant domain redirection chain to incident-handling agencies.

DOM Structural graphs can also be visualized in a tree structure form for every URL which gives details of the DOM elements. It provides information regarding the placement of DOM elements in any site. The graphs are generated in PNG format for every analyzed site.

4.8 Parallel Evaluation

UAC performs parallel evaluation with Google-Safebrowsing for every URL and the results are presented to the user on the same analysis console. The last date of Google validation for any site is also included. Google declares website as suspicious or benign and also provides additional information like domains acting as intermediaries for malware distribution, or the websites that are actively involved in transmitting infections. This facilitates benchmarking and comparison with UAC results.

4.9 Distributed Deployment

UAC is implemented as a Low interaction Honeyclient and has been integrated in Distributed Honeynet System (DHS). Currently, DHS nodes are operational at eight geographical locations across India. The distributed deployment is done through implementation of UAC as a virtual machine in DHS client node. The central analysis server performs the load balancing and load distribution to various nodes depending upon URL list.

The actual analysis is performed at the client and the results are mapped to a central analysis server on regular basis. This significantly reduces the transmission overhead and consumes less bandwidth and memory. Also, this system minimizes the operating cost of server.

5 Experimentations and Evaluations

5.1 Performance Measurement (Standalone Systems)

See Tables 4 and 5.

Table 4 Performance measurement of UAC

Full size table

Table 5 UAC system measurements

Full size table

5.2 Performance Measurements (Distributed Systems)

See Table 6.

Table 6 UAC aspects for distributed deployments

Full size table

5.3 Evaluations with Respect to Other Low Interaction Honeyclient

UAC has been evaluated against other open-source Low interaction Honeyclients with respect to feature-set and analysis capabilities. Table 7 presents the comparison results and depicts the effectiveness of UAC in detecting large number of malicious URLs.

Table 7 Comparison of UAC with other Low Interaction Honeyclients

Full size table

5.4 Experimental Evaluations

List of Potentially malicious sites were derived from various sources including Cert-In. These sites are analyzed by UAC and the results have been shared with incident response group. This also aids in validation of UAC results. Following statistics have been generated from these experimentations (Table 8).

Table 8 Experimental Evaluation of UAC

Full size table

5.5 Multi-threading Approach

A multi-threaded application permits a still faster execution of UAC. However, multi-threading exploits the parallelism inherent in the program itself. Table 9 provides an overview of the various stages in UAC that are candidates for multiple thread execution.

Table 9 Mutli-threading process in UAC

Full size table

The performance improvement using multiple threads is directly visible from following performance measurements:

	Latency (s)	Throughput (URLs/h)
With threading	12	300
Without threading	20	180

6 Towards Signature Formulation

Anti-virus scanners detect attacks based on their signature database. With the ever growing diversification in the attack code, it becomes a useful and desirable activity to generate signatures for the unknown attacks. However, the main goal of our approach is to update the signature database of open-source community anti-virus i.e. Clamav.

All the JavaScripts that are declared malicious by UAC are further validated by submission to Virus-total portal to determine if popular anti-virus scanners also label them as malicious. The developed automated mechanism for signature generation filters out all the scripts which are labeled as malicious by popular antivirus engines but not by clamav. Subsequently, hexadecimal and hash-based signatures are generated for the filtered JavaScript. These are eventually populated in clamav to enhance its signature repository. This activity is a continual process to permit the regular enrichment of open-source signatures repository.

7 Conclusion and Future Work

UAC is a novel approach towards distributed and scalable analysis of URLs which leverages the significance of dynamic execution (through emulation) and static analysis. UAC inspects the webpage from various perspectives including suspicious DOM parsing and JavaScript analysis and attempts to cover maximum analysis domain. Also, other popular dynamic client side scripts like Jscripts are accommodated in our analysis easily because they are based on ECMA standards [32] and SpiderMonkey interprets ECMA scripts. We have even manually analyzed URLs declared as benign by UAC to identify the reasons of failures and found that in most of the sites, the infection is already removed by the time it is analyzed by UAC. However, certain other analysis processes like integration of file analyzers including SWF, PDF, etc. can be integrated for further inspecting the complete downloaded web-code. Also, in some websites, we happened to come across malware injected in the form of VB scripts, which is currently not included in our scope.

The distributed crawling is the area that we can pursue further making use of facilities like grid computing to perform large-scale analysis. Also, the whole application can be ported on a High performance computing infrastructure to optimize the speed and levels of performance for distributed computing.

References

Drive-by download—Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Drive-by_download
Egele, M., Wurzinger, P., Kruegel, C., Kirda, E.: Defending browsers against drive-by downloads: mitigating heap-spraying code injection attacks. In: Proceedings of DIMVA’09, 6th International Conference on Detection of Intrusions and Malware and Vulnerability Assessment, Milano, Italy, 9–10 July 2009. Springer LNCS
Google Scholar
Secure Browsing, Malware Protection, Trustwave. https://www.trustwave.com/securebrowsing/
Google Safe Browsing. http://www.google.com/tools/firefox/safebrowsing/
Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Proceeding of the 19th International Conference on World Wide Web, pp. 281–290. ACM, New York (2010)
Google Scholar
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of WWW 2011. ACM, Hyderabad, India, 28 March–1 April 2011
Google Scholar
Song, C., Zhuge, J., Han, X., Ye, Z.: Preventing drive-by download via inter-module communication monitoring. In: Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, ASIACCS’10, pp. 124–134. ACM, New York (2010)
Google Scholar
Zhang, J., Seifert, C., Stokes, J.W., Lee, W.: ARROW: generating signatures to detect drive-by downloads. In: Proceedings of WWW 2011. ACM, Hyderabad, India, 28 March–1 April 2011. 978-1-4503-0632-4/11/03
Google Scholar
Ratanaworabhan, P., Liyshits, B., Zorn, B.G.: Nozzle: a defense against heap-spraying code injection attacks. In: Proceedings of the 18th Conference on USENIX Security Symposium, SSYM’09, pp. 169–186. USENIX Association, Berkeley (2009)
Google Scholar
Wei, T., Wang, T., Duan, L., Jing, L.: Secure dynamic code generation against spraying. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS’ 10, pp. 738–740. ACM, New York (2010)
Google Scholar
Dewald, A., Holz, T., Freiling, F.C.: ADSandbox: sandboxing JavaScript to fight malicious websites. In: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC’10, pp. 1859–1864. ACM, New York (2010)
Google Scholar
BLADE—Block All Drive-by Download Exploits. http://www.blade-defender.org/
Lu, L., Yegneswaran, V., Porras, P., Lee, W.: BLADE: an attack agnostic approach for preventing drive-by malware infections. In: Proceedings of the 17th ACM Conference on Computer and Communication Security, CCS’10, pp. 440–450. ACM, New York (2010)
Google Scholar
Seifert, C., Welch, I., Komisarczuk, P.: Honeyc—the low-interaction client Honeypot. In: Proceedings of the 2007 NZCSRCS, Waikato University, Hamilton, New Zealand (2007)
Google Scholar
Nazario, J.: PhoneyC: a virtual client Honeypot. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and more, LEET’09, p. 6. USENIX Association Berkeley, CA (2009)
Google Scholar
Forest, D., Weisen, C., Leong, K.P., Siang, H.Y.: HoneySift: a fast approach for low interaction client based Honeypot. In: www.studyMode.com. 23 Jan 2011. http://www.studymode.com/essays/Honeysift-A-Low-Interaction-Client-Honeypot-558127.html
Ikinci, A., Holz, T., Freiling, F., Mannheim, G.: Monkey-Spider: detecting malicious websites with low-interaction Honeyclient. Sicherheit, Saarbruecken (2008)
Google Scholar
Alosefer, Y., Rana, O.: Honeyware: a web-based low interaction client Honeypot. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW’10, pp. 410–417. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Feinstein, B.: Caffeine Monkey: Automated Collection, Detection and Analysis of JavaScript. Dell Secure-Works Inc., BlackHat USA, Las Vegas (2007)
Google Scholar
Eshete, B., Villafiorita, A., Weldemariam, K.: BINSPECT: Holistic Analysis and Detection of Malicious Web Pages. SecureComm 2012, pp. 149–166 (2012)
Google Scholar
Curtsinger, C., Livshits, B., Zorn, B.G., Seifert, C.: ZOZZLE: fast and precise in-browser JavaScript malware detection. In: USENIX Security Symposium (Microsoft Research) (2011)
Google Scholar
Choi, Y., Kim, T., Choi, S., Lee, C.: Automatic detection for JavaScript obfuscation attacks in web pages through string pattern analysis. In: Future Generation Information Technology, Lecture Notes in Computer Science, vol. 5899, p. 160. Springer, Berlin (2009). ISBN 978-3-642-10508-1
Google Scholar
Xu, W., Zhang, F., Zhu, S.: JStill: mostly static detection of obfuscated malicious JavaScript code. In: Proceedings of the Third ACM Conference on Data and Application Security and Privacy, CODASPY’13 (2013)
Google Scholar
Li, Z., Zhang, K., Xie, Y., Yu, F, Wang, X.F.: Knowing your enemy: understanding and detecting malicious web advertising. In: ACM Conference on Computer and Communications Security 2012 (Microsoft Research), pp. 674–686 (2012)
Google Scholar
Elinks—lynx-like alternative character mode WWW browser. http://manpages.ubuntu.com/manpages/lucid/man1/elinks.1.html
Spider Monkey, MDN. https://developer.mozilla.org/en/docs/SpiderMonkey
Chapter 6—Shannon entropy. http://www.ueltschi.org/teaching/chapShannon.pdf
Random Forest. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/RamdomForest.html
Rotation Forest. http://weka.sourceforge.net/doc.packages/rotationForest/weka/classifiers/meta/RotationForest.html
iScanner. http://iscanner.isecurity.org
Snort. http://www.snort.org
ECMA Standards. http://www.ecma-international.org/publications/standards/Standard.htm

Download references

Acknowledgements

We are grateful to Dr. Bruhadeshwar Bezawada, Assistant Professor, IIIT, Hyderabad for his support, time-to-time guidance and periodic feedback on the analysis process. He has also suggested various improvements to address scalability.

We are also thankful to Mr. S. S. Sarma, Scientist ‘E’, Cert-In for providing useful inputs regarding the selection of significant parameters for analysis. Cert-In team has been regularly providing us the list of URLs and evaluating our results.

Author information

Authors and Affiliations

Cyber Security Technology Division, C-DAC, Mohali, 160071, India
Harneet Kaur, Sanjay Madan & Rakesh Kumar Sehgal

Authors

Harneet Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Madan
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Kumar Sehgal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harneet Kaur .

Editor information

Editors and Affiliations

Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Radek Silhavy
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Roman Senkerik
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Zuzana Kominkova Oplatkova
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Petr Silhavy
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Zdenka Prokopova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaur, H., Madan, S., Sehgal, R.K. (2014). UAC: A Lightweight and Scalable Approach to Detect Malicious Web Pages. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Silhavy, P., Prokopova, Z. (eds) Modern Trends and Techniques in Computer Science. Advances in Intelligent Systems and Computing, vol 285. Springer, Cham. https://doi.org/10.1007/978-3-319-06740-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-06740-7_21
Published: 06 May 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06739-1
Online ISBN: 978-3-319-06740-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

UAC: A Lightweight and Scalable Approach to Detect Malicious Web Pages

Abstract

Similar content being viewed by others

MalCrawler: A Crawler for Seeking and Crawling Malicious Websites

BINSPECT: Holistic Analysis and Detection of Malicious Web Pages

Understanding Evasion Techniques that Abuse Differences Among JavaScript Implementations

Keywords

1 Introduction

2 Related Work

3 Problem Definition and Approach Adopted

3.1 Approach Adopted

3.2 Challenge 1: Overcoming the Challenge of Multiple Browser-OS Combinations to Detect Actual System Exploit

3.3 Challenge 2: Capturing Available and Generated (Static and Dynamic) Webpage Contents

3.4 Challenge 3: Transient Malware Compromises Effectiveness of Static Analysis

3.5 Challenge 4: Inspection of Dynamic JavaScript Behavior to Detect Mal-code and/or Redirections

3.6 Challenge 5: Establish Significant (Legitimate and Illegitimate) JS Function-calls

3.7 Challenge 6: Scalability Aspects

4 UAC Design

4.1 URL Active Crawling

4.2 Hash-Based Redundancy Checker

4.3 Hybrid Analysis Mechanism

4.4 DOM Parsing to Detect Suspicious DOM Elements

4.5 JavaScript Analysis

4.6 Signature Scanning

4.7 Redirection Domains and DOM Structural Graph

4.8 Parallel Evaluation

4.9 Distributed Deployment

5 Experimentations and Evaluations

5.1 Performance Measurement (Standalone Systems)

5.2 Performance Measurements (Distributed Systems)

5.3 Evaluations with Respect to Other Low Interaction Honeyclient

5.4 Experimental Evaluations

5.5 Multi-threading Approach

6 Towards Signature Formulation

7 Conclusion and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation