Keywords

3.1 Introduction

Our modern world is rapidly moving toward digitalization and automation, where everything is converging into an automated version. As technology takes over our lives, we are at the start of the 4th industrial revolution, which mainly focuses on a world that relies heavily on technology and innovation. The use of technology not only provides us with convenience but comfort as well. However, the rapid development of technology comes at the price of ensuring cybersecurity. Attackers are finding many ways to achieve their malicious goals, which requires us to take precautions to face such security issues. One of the most popular and common forms of security invasion in our digital world is using malicious code, often referred to as malware [27]. Malware is a code written by security attackers to intrude into a specific computer system or software to perform malicious acts such as stealing data or causing damage. For example, malware could be in different forms, such as worms, viruses, trojans, spyware, adware, or ransomware. Therefore, it is essential to protect any system from malware. This can be done by detecting the malware and then classifying which type it is. A tremendous amount of research has been conducted in the past years regarding the topic of malware detection and classification [11].

According to recent reports, malware generation and creation have been increasing rapidly on a daily basis. It is estimated that around one million malware files are created daily [31]. This increase could seriously threaten the economy, both financially and technically. The increase in cyber threats and crimes costs the economy around 1 trillion dollars in 2022 for cyber insurance, which results in an increase of 50% in comparison to the past 2 years [12]. The term malware refers to any malicious entity that changes the original behavior by utilizing software flaws and vulnerabilities. In this chapter, the term malware will be used to refer to any malicious software that may include any of the following malware families, ransomware, adware, viruses, or keyloggers [11].

Depending on the purpose and behavior of the malware, it is categorized into different families. Every family has common features. For instance, stealing information, creating vulnerability, and denial of service are all examples of malware behavior. Such behaviors are essential in detecting malware since this information will be used to analyze the software and categorize it into benign or malware [35]. To differentiate between malicious and benign apps, we need to scan the program code first, extract its features, and analyze them [6]. Features extraction can be achieved through two main ways: static analysis [3] and dynamic analysis [13]. Another possible way is to use hybrid analysis [2], a combination of the previous two [25]. Static analysis is concerned with contextual data from the source code without running the program. However, dynamic analysis involves executing the program and extracting the runtime features. The hybrid analysis uses both contextual and runtime features to detect malware [11].

Over the years, researchers have been developing new techniques for malware detection. The latest trend in this field is using machine learning for malware detection. However, this technique cannot be used without analyzing the program code and extracting important features that help in discriminating the malware families [22]. It is possible to evade the risk of malware if the related features are available. Therefore, a collection of advanced detection methods using machine learning depends on feature engineering as well as reverse engineering [33]. Feature engineering is a technique used to manipulate unstructured data into features that can be understandable by the computer or machine [32]. However, other techniques, such as binary obfuscation, can be used by attackers to design a reverse engineering resistant file [30]. Moreover, deep learning can be used in an advanced model of neural networks to capture features, learn, and adapt during training. Even though a few studies report the use of deep learning, some do not discuss the scalability and different architectures enough for malware detection [5, 33].

One of the main benefits of using static analysis over any other technique is that this analysis does not require executing the program, making it a safer choice to apply [25]. Moreover, another vital benefit is examining the code without regard to the diversity of IoT architecture or the physical capabilities of an IoT device. Hence, the analysis considers all possible inspection methods with no reference to the physical performance [24]. Furthermore, due to the nature of the static analysis, the malware may not be able to avoid, hide, and/or obfuscate during the analysis process because it runs passively [34]. Finally, its automation characteristic is what makes static analysis prominent and outstanding [16].

Therefore, this chapter introduces a new comprehensive static parsing software called ASParseV3. It is an extension to ASParseV1 [1]. It is a GUI-based tool with various features such as (a) selecting many files or directories to be scanned in one experiment, (b) adding or removing keywords/features, (c) filtering the keywords/features and specific file types, (d) efficient scanning process as many files are scanned simultaneously, (e) providing customizable visualization dashboards with the ability to export the chart(s), and (f) exporting the results in different formats such as JSON and CSV.

The rest of the chapter sections present and discuss the related works regarding malware analysis techniques, malware detection, and the use of static analysis for malware detection. Moreover, they present the proposed developed software (ASParseV3), which performs static features extraction and parsing. Also, the chapter demonstrates a use case of Android OS malware static features extraction using the ASParseV3 software. Finally, conclusions with a summary of possible future works are presented.

3.2 Related Works

Parsing the features of source code is potentially utilized in estimating the software performance, reverse engineering, and static analysis [20]. However, the extracted features can be represented in different formats such as gray-scale images, structural entropy, or JSON file [15]. Moreover, the extracted features can be further deployed in various fields. For instance, the authors of [21] have developed a tool named DeepTLS to analyze encrypted traffic by extracting the features from the network packets. In [28], the python-Evtx-parser (pexp) has been developed to parse the required features to detect Lateral Movement Attacks. In a nutshell, Table 3.1 demonstrates a comparison among related works.

Table 3.1 A comparison among existing parsing tools

Several tools have been proposed to perform static parsing in Android platform [1, 8, 23]. Khalid et. al. proposed a memory parsing tool for Android applications [19]. The authors of [17] have developed Sena TLS-Parser, a tool that automates the software testing process by parsing the Android source code. Initially, the Android source code is imported into the Eclipse environment. Subsequently, Sena TLS-Parser scans the code and generates the required test cases. Another approach that utilizes static parsing in enhancing the development of Android applications is by recommending a suitable API for the Android developer based on the parsing results. In [36], the authors have developed APIMatchmaker, a tool that recommends the best API usage by parsing similar Android apps.

Parsing Android source code can further be deployed in detecting malicious applications. In [26], the authors have parsed the suspect methods of two Android apps in order to extract their similarities using their proposed tool, StrAndroid. Consequently, they identified the potential malicious behaviors that are shared between the two apps. Additionally, Android permissions can be parsed in order to rank the risk of the malicious application. Dharmalingam et al. proposed a permission grading scheme that extracts and defines the required permissions in an Android app and rates the risk of the app accordingly [14]. In their proposed scheme, the Manifest file is parsed to extract the defined permission in the app. Subsequently, the extracted permissions are fed into the feature encoder to be further utilized in the deep neural algorithm for detecting malware applications. However, static analysis can be combined with dynamic analysis to increase the efficiency of malware detection. In [2], the authors have applied static analysis as a prior stage to implementing the dynamic analysis.

The efficiency of the parsing approach highly affects the overall static analysis process. The authors of [18] applied canonical representation to enhance the parsing process for Android code by developing the static analyzing tool, PetaDroid. The core of this proposed solution is to define the application’s behavior by tracking the used APIs and the app’s actions. Consequently, fingerprinting the malware applications. Besides the API calls, the permissions can be utilized to determine the malicious application’s behavior. In [29], the APK file has been decomposed using APKtool to retrieve the Manifest file and class.dex file. The aforementioned files were parsed to extract the permissions and the API calls, respectively. Then, multidimensional behavior analysis was conducted on the extracted features to develop a malware portrait. Even though there are many static parsing tools, they are not flexible in accepting many file systems and can extract only a limited number of features. Moreover, they do not have a customizable graphical user interface (GUI). Therefore, there is a need for a customizable GUI-based system with the ability to scan an unlimited number of features on various file systems.

3.3 Proposed System

There is a need for user-friendly, extensible, and flexible software. This chapter introduces the third version of the Android Static Parse (ASParse). The tool ASParse-V3 is an improvement to the previous versions. It is a cross-platform, portable, and general tool that performs static analysis and features parsing for any file type while supporting different operating systems. This version of ASParse is efficient and fast due to its concurrent scanning characteristic. Furthermore, ASParseV3 can be used as a preprocessing method for static feature extraction to construct datasets for subsequent processing through ML/DL models due to its feature of exporting the results to JSON and CSV files. For instance, the previous versions of the ASParse tool were used to extract static features and develop different types of datasets [1]. For example, [4, 7] utilized the ASParse tool to extract the API and permissions of thousands of Android applications. The extracted features created a dataset that helped detect Ransomware apps with high accuracy.

3.3.1 System Overview

To illustrate the system flow, Fig. 3.1 shows how the ASParseV3 application generally works. The first step is uploading the files, directories, or multiple directories. The second step is choosing a set of predefined features or adding specific features. Then, moving to the third step, the system scans the files to export the results. Finally, after the results are exported, they can be visualized via a customizable dashboard.

Fig. 3.1
A flow diagram represents the following sequence of actions. Upload files, directories, or applications, select file types, select keywords, scan files, export results, and visualize results. A set of sub-actions are denoted under selecting file types, keywords, and exporting results.

Flow structure of the proposed system

3.3.2 Features and User Interfaces

The scalability and portability of ASParseV3 are achieved by integrating it with a portable development environment that also makes the software cross-platform to be installed on various operating systems (OSs). In addition, the software’s scope can be used as general and specific. For example, it can scan and parse different input formats, such as Android and Windows applications. Furthermore, ASParseV3 is user-friendly due to the modern graphical user interface (GUI) that is easy to use and its customizability based on the user’s needs. For instance, the user can customize features and file types to be scanned and customize the scanning results based on the filtering feature available on the results dashboard. The system process is divided mainly into five steps: uploading files, selecting file types, choosing keywords, scanning, and results visualization. Each phase has a separate user-friendly window.

3.3.2.1 Uploading Files Window

The first window of the application is used to upload files or applications to be scanned. The user can upload multiple files, directories, or a single directory. As Fig. 3.2 illustrates, the button “Add” is clicked to upload the applications, which opens a file selector dialog window to upload files/directories. All uploaded files will be shown on a panel field. The user may also clear the uploaded files in the panel field by clicking on the “Clear” button and adding new applications when needed.

Fig. 3.2
A screenshot of A Sparse version 3 application window represents a list of application paths. It highlights the buttons labeled upload files, add, and clear.

Uploading applications window

3.3.2.2 Selecting File Types Window

The second window allows users to select files of specific types (file extensions) to be scanned. Figure 3.3a shows a sample of Android OS file types. The user may choose one or multiple types by checking the checkbox. Moreover, the user can customize the file types by adding or deleting types by clicking on the settings icon on the top right of Fig. 3.3a. The settings button opens a new window for editing, as Fig. 3.3b illustrates. The user can write the file types in the text field and then click on the button “Add” to add them to the current panel. The user can also delete any newly defined types by clicking on the button “Remove.” By default, if no checkboxes were chosen, all predefined file types will be included in the scanning process.

Fig. 3.3
A set of 2 screenshots of the A Sparse applications. a, It depicts the selecting window highlighting the selected file types of x m l and s mali. b, It represents the customizing window, where the t x t and J son file types are put in the text box for adding.

Selecting and customizing file types windows. (a) Selecting Window. (b) Customizing Window

3.3.2.3 Selecting Keywords Window

The third window allows users to select the keywords to look for while scanning. Figure 3.3a shows a sample of Android OS file types. However, the user can customize the features through the settings window by adding or deleting keywords by clicking on the settings icon on the top right of the window (as shown in Fig. 3.4a). Similar to the file types editing feature, the settings button can be used to edit the list of keywords, as illustrated in Fig. 3.4b.

Fig. 3.4
A set of 2 screenshots of the A Sparse applications. a, It represents the selecting window highlighting the selected keywords, named, android. b, It represents the customizing window. The keyword named callback is added at the top.

Selecting and customizing keywords windows. (a) Selecting Window. (b) Customizing Window

3.3.2.4 Scanning Window

The fourth window allows users to add the configuration values of an experiment, such as the experiment name and the path used to save the results, as shown in Fig. 3.5. Then, the scanning process begins by clicking on the “Scan” button. Finally, the progress bar provides the user with real-time updates on the scanning progress.

Fig. 3.5
A screenshot of the A Sparse application represents a text box to enter the experiment name and an option to browse the output path. It highlights the scan button on the left panel and the previous, scan, and next buttons at the bottom.

Scanning window

3.3.2.5 Visualizing Results and Dashboard Window

The fifth and final window links the tool to the visualization dashboard. After completing the scanning progress, the user can move to the visualization window and click on the “Visualize” button as shown in Fig. 3.6a to display the results in terms of a plot. The actions performed in this window do not affect the scanning results. It is a complimentary step for results visualization and filtering. However, this step cannot be completed without performing the scanning. When visualization is activated, a dashboard page opens in the browser. The dashboard is where the user can visualize the parsing results. The plot’s X-axis represents the features (keywords), and the Y -axis represents the number of occurrences. As Fig. 3.6b illustrates, the dashboard is customizable based on the user’s preference. For instance, the user may filter out and visualize the results according to the minimum number of feature occurrences and features containing a specific string or substring. Also, the resulting graph (plot) can be exported as an image using the saving button on the right of the plot. This can help the researchers/experts to share their results conveniently.

Fig. 3.6
Two screenshots. a, It represents the visualization window highlighting the visualize result option on the left panel and the button to visualize in the center. b, It represents the dashboard page denoting a graph of the number of occurrences versus features and a set of settings on the left.

Visualization window and page. (a) Visualization Window. (b) Dashboard Page

3.3.3 Use Case

To demonstrate the tool, Android benign samples and malware samples were used. The samples come in the form of an Android Package Kit (APK). The APKs contain all software details, including source code, permissions, and APIs used. However, APKs are compressed files that need reverse engineering to recover the application code [9]. APKToolFootnote 1s was used to decompile the apps and extract the source files. Afterward, the decompiled APKs were fed to the ASParse tool.

3.3.3.1 Data Collection

For data collection, two sources were used, Drebin Dataset [10] and APKCombo.Footnote 2 The Drebin Dataset contained 5560 malware samples belonging to 179 malware families. On the other hand, the benign data samples were downloaded through APKCombo. Ten samples were randomly chosen from the Drebin dataset, along with ten samples from APKCombo. To ensure that the apps downloaded from APKCombo are benign, they were scanned by a well-known website called VirusTotal.Footnote 3 This website offers tens of Antivirus engines that are specialized in detecting different types of malware.

3.3.3.2 Tests and Results

The experiment was performed on a sample of 10 benign APKs and 10 malicious samples from the Derbin dataset. First, all files were added to the application upload field. Then, all predefined file types were chosen. Afterward, six keywords from the predefined ones were chosen, including android, android/animation, and android/app. In addition to the keywords Bundle and Button and Callback. After clicking on the visualization button in the final window, the application will shift to the dashboard, where the plot will be displayed with the ability to save the plot after customizing it. Figure 3.7 illustrates the saved plot sample. Moreover, Fig. 3.8 illustrates a sample of the saved plot where it illustrates the details of each data point on the plot. Furthermore, Table 3.2 demonstrates a sample of the resulting CSV. Finally, Fig. 3.9 represents the JSON metadata file resulting from the scan.

Fig. 3.7
A dot plot represents the data of the number of occurrences for a set of features, which include Android, Android animation, Android app, Bundle, button, and callback. Android denotes the highest number of occurrences.

Features vs. Occurrences Plot

Fig. 3.8
A dot plot represents the data of the number of occurrences for a set of features which include Android, Android animation, Android app, Bundle, button, and callback. It highlights the app names of Malware sample 4 and Benign sample 1 for the plots of Android.

Data point details

Fig. 3.9
A snippet of code denotes a list of application paths, the output path, file types, selected file types, keywords, selected keywords, and experiment name.

Metadata JSON content for the use case

Table 3.2 The resulting CSV from the use case

3.3.3.3 Validation

The validation process for ASParseV3 was carried out thoroughly to ensure that its performance, user interface (UI), and user experience (UX) met the required needs. The Security Engineering Lab (SEL) conducted the validation and compared the scanning results of ASParseV3 with previous releases of ASParse. In addition, VirusTotal was used to retrieve information such as permissions used in the applications/APKs to compare with ASParseV3 and verify further its scanning results’ accuracy. To validate the use case, VirusTotal was used to collect the permissions used by the APK. Figure 3.10 shows a sample of the permissions used by the APK validation test sample. The resulting permissions were then used to scan the same APK using ASParseV3. The results showed that ASParseV3could scan the uploaded APK and accurately report the number of occurrences for each permission. Overall, the validation process demonstrates that ASParseV3 is a reliable and efficient tool for scanning applications and APKs features such as permissions. The comparison with previous releases and the use of VirusTotal helped ensure the scanning results’ accuracy. For example, Table 3.3 illustrates the number of occurrences of each permission found by ASParseV3 during the validation process. Moreover, using ASParseV3 to scan the same application without specifying any keywords has resulted in showing additional permissions/API calls other than the ones retrieved from VirusTotal as Table 3.4 illustrates. Hence, this validates the accuracy of the ASParseV3 and its additional capabilities compared with similar tools.

Fig. 3.10
A list of texts represents the A P K permissions. The permissions include receive boot completed, access Wi-fi state, A D I D, bind get install referrer service, and billing.

APK permissions from VirusTotal

Table 3.3 Validation results
Table 3.4 ASParseV3 additional permissions and calls

3.4 Conclusion and Future Work

This chapter proposed a third version of ASParse software as a parsing and static analysis tool. The analysis results can be used to feed machine learning algorithms and deep learning models for malware analysis and detection. Moreover, a demonstration was presented on Android OS applications showing the system’s capabilities. In future work, the ASParse tool will be used to carry on with malware detection using ML and DL algorithms and models. Moreover, it will be enhanced in terms of performance and user experience.