1 Introduction

Smart terminals have become fundamental personal equipment. More and more customers use smart phones to get a wealth of information and contact with each other. On the one hand, the openness of Android markets plays an important role in the popularity of Android applications (apps). On the other hand, Android apps have become the target of many attackers. According to the 2016 China Internet Security Report (China Internet Security 2016), a total of 14.03 million new malware samples on Android platform have been intercepted by 360 Internet Security Center. Ransomware started to break out on mobile phones. Throughout the year, 360 corporation intercepted 0.17 million new ransomware samples on mobile phones, attacking 1.70 million mobile phones (China Internet Security 2016). In 2017, the number of infections in Android platform is expected to grow ten times revealed by the security report. Nowadays, how to process the big data in security has become more and more important (Hamedani et al. 2018; Wu et al. 2016a, b; Atat et al. 2017). Research on modern cryptographic solutions for computer and cyber security is becoming increasingly popular (Ibtihal et al. 2017; Gupta et al. 2016). With the increasing threats of Android malware, it is urgent to develop effective malware detection methods that help to keep the threats out of individuals and the markets.

In recent years, machine learning models have been widely employed in malware detection. These models can learn the distinctions between malicious and benign apps. Deep learning is a relatively new area in machine learning research. In deep learning methods, the low-level-layers extract fine features, while high-level-layers exhibit higher-layered features. As one of the major models in deep learning, convolutional neural network (CNN) has been popularly used for image recognition (Li et al. 2018b) and shown promising performance in contextual categorization. In this work, we are motivated to detect Android malicious apps with two different structures of CNN. We also propose a new pre-training strategy DAE to learn more suitable feature representations. The proposed approach improves both the efficiency and accuracy of Android malware detection compared with basic CNN and machine learning methods.

We make the following contributions:

  • We use CNN with different structures to improve the detection accuracy in Android malware detection. The experimental results show that CNN-S can reach 99.8% prediction accuracy that is 5% higher than SVM.

  • We develop two different CNN architectures. The experimental results are different with different CNN architectures. We finally use the model CNN-S and the experimental results show that the detection accuracy with CNN-S is higher than that with CNN-P (99.80–99.82%) and is improved by 3% compared with basic CNN.

  • We propose using DAE as a pre-training method of CNN-S to reduce the training time. We add the sparse rules to pre-train the models and use Relu, the non-linear function as the activation function, which can efficiently extract abstract features. Extensive experimental results demonstrate that DAE-CNN can reduce the training time by 83% compared with CNN-S.

The rest of this paper is organized as follows. Section 2 introduces related work on Android malware detection and Deep learning methods. Section 3 describes the architecture and theoretical background of DAE, CNN and DAE-CNN. Section 4 shows how CNN can be used to detect Android malware. The experimental results are also demonstrated in Sect. 4 and the conclusions follow in Sect. 5.

2 Related work

Existing work on the detection of malicious apps mainly focuses on the analysis of static features (Rastogi et al. 2016; Sarma et al. 2012; Pandita et al. 2013; Lu et al. 2012; Klieber et al. 2014), or dynamic features (Enck et al. 2014; Wu and Hung 2014; Amos et al. 2013). Li and Li (2015) proposed an Android malware detection method based on characteristic trees. Yerima et al. (2014) developed machine-learning approaches based on Bayesian classification to detect uncovering unknown Android malware. Zhou and Jiang (2012) proposed a permission based scheme to detect new Android malware family samples and applied a heuristics-based filtering scheme to identify certain inherent behaviors of unknown malicious families. Shabtai et al. (2010) studied the techniques of static analysis to analyze Android source code. They also applied machine learning techniques to categorize games and tools with static features extracted from Android apps. There exists work on securing smartphone authentication (Shen et al. 2018a, b, c, d) or securing cloud authentication (Li et al. 2018a, b; Wang et al. 2018a, b, c, d; Shen et al. 2018a, b, c, d; Xie et al. 2018; Chen et al. 2015) and securing user authentication (Shen et al. 2017). In our previous work, we employed the ensemble of multiple classifiers, namely, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Naive Bayes (NB), Classification and Regression Tree (CART) and Random Forest (RF), to detect malicious apps and to categorize benign apps (Wang et al. 2018a, b, c, d). We also explored the permission-induced risk in Android apps on three levels in a systematic manner (Wang et al. 2014a, b). In addition, we gave insights regarding what discriminatory features are most effective to characterize malicious apps for building an effective and efficient malicious app detection system (Wang et al. 2017). We developed a tool called “SDFDroid” to identify the used sensors’ types and to generate the sensor data propagation graphs in each app (Liu et al. 2018). Feature selection is one of the significant steps in classification (Zhang et al. 2017; Lee et al. 2017; Memos et al. 2018; Wang et al. 2015) or intrusion detection (Wang et al. 2008, 2014a, b, 2018a, b, c, d; Guan et al. 2009; Wang and Battiti 2006). Using machine learning methods can automatically analyze malicious behavior and detect malware effectively.

Deep learning is a new area of machine learning and has been widely applied to many scenes (Hamedani et al. 2018), such as event detection and analysis (Chang et al. 2017a, b; Chang and Yang 2017; Li et al. 2017a, b). It can also perform well in Android malware detection. Yuan et al. (2016) proposed to associate the features from the static analysis with features from dynamic analysis of Android apps and characterize malware using Deep Belief Network (DBN) (Bengio 2009; Hinton et al. 2014). CNN has shown popular performance in recognition and classification area, which is proposed by Lecun (Lecun et al. 1998) in 1998. The latest work employing CNN to Android malware detection has been found in 2017 (Huang et al. 2017; Nix and Zhang 2017). Different from existing work, we focus on the evaluation of different CNN structures and how to reduce the training time without changing the computing environments. To analyze the effect of different structures on detection accuracy, two different structures of CNN and one basic structure of CNN are employed in the same experimental environment. We also evaluate the influence of different activation functions. Due to the plenty of parameters, we use “dropout” (Hinton et al. 2012) technique to prevent complex weights co-adaptations and reduce overfitting. In addition, we use the pre-training methods for the detection. Different from existing methods, we propose a hybrid model by combining DAE and CNN to reduce the training time.

3 Methods

3.1 Architecture of deep autoencoder (DAE)

Typical autoencoder (Hinton and Zemel 1994) is an unsupervised model that learns to reconstruct the input. Deep learning models are capable of learning complex hierarchical nonlinear features, which are considered as better representations for original data in many fields such as speech recognition and computer vision (Zhang et al. 2017). The basic structure of autoencoder contains encoder layer, hidden layer and decoder layer. The input of hidden layer is the output of encoder layer and the input of decoder layer is the output of hidden layer. The function of autoencoder is composed by the encoder and decoder with the symmetrical architecture to map \({R^d} \to {R^d}\). The encode layer parameterized by \(\theta =\{ W,b\}\) with input sequence \({\varvec{x}}\) and output \({\varvec{y}}\) is:

$${\varvec{y}}=~{f_\theta }\left( x \right)=~\sigma (W \times {\varvec{x}}+b)$$
(1)

where σ(x) = max (0, x) is the activation function, Relu, of the encode layer. Compared with Sigmoid, Relu can usually eliminate the necessity of pre-training and make deep learning models converge to sometimes more discriminative solutions more quickly, while keeping the model sparse (Lecun et al. 1998; Huang et al. 2017). The reconstructed function between hidden layer and decode layer is:

$$z=~{f_{\theta ^\prime }}\left( y \right)=~\sigma (W^\prime \times y+b^\prime )$$
(2)

The main target of autoencoder is to minimize the following function:

$$\theta ,\theta ^\prime =~{\hbox{min} _{\theta ,\theta ^\prime }}\mathop \sum \limits_{{i=1}}^{d} L({x_i},{z_i})$$
(3)

to ensure that the hidden layer can reconstruct the input layer.

In this work, we use DAE model, which has more than one hidden layer to extract features from training data. In addition, we extend DAE to complete the classification of Android apps. There are four layers in DAE model (as illustrated in Fig. 1), one encoding layer, two hidden layers and one classification layer. Using softmax as the activation function of classification layer and training data with labels, we achieved the goal of detecting Android malware.

Fig. 1
figure 1

Deep autoencoder model

3.2 CNN with different architectures

The CNN-S model architecture, shown in Fig. 2, is a slight variant of basic CNN architecture.

Fig. 2
figure 2

CNN-S model

We reconstruct the extracted features of each app as the input of the convolutional layer. Given \({{\varvec{x}}_{\varvec{i}}} \in {R^k}\) as the k-dimensional feature vector corresponding to the \(i\)-th feature in the feature codes of each Android app, the Android app of length n (padded where necessary) can be represented as:

$${{\varvec{x}}_{1:~n}}={{\varvec{x}}_1} \oplus {{\varvec{x}}_2} \oplus \ldots \oplus {{\varvec{x}}_{\varvec{n}}}$$
(4)

The input data are convoluted by the kernels and learnable filters. A filter applied to a window of \(m\) features to produce a new feature. For example, the feature \({y_i}\) is generated from a window of features \({{\varvec{x}}_{i:~i+m - 1}}~\) filtered with \(W \in {R^{m*k}}\) by:

$${{\varvec{y}}_i}=f(W*{{\varvec{x}}_{i:i+m - 1}}+b)$$
(5)

where \(f\left( x \right)={\text{max~}}(0,~x)\) is a nonlinear activation function Relu, \(b \in R\) is a bias term. The filter \(W \in {R^{m*k}}\) is applied to each possible window of features in \(\{ {x_{1:~m}},~{x_{2:~m+1}},~ \ldots ,~{x_{n - m+1:~n}}\}\) to produce a feature map \({\varvec{y}} \in {R^{\left( {n - m+1} \right)*1}}\). The feature maps are imported to the max-pooling layer, taking the maximum value \(Y={\text{max}}\{ {\varvec{y}}\}\), capturing the most important features and reducing the feature dimension for each feature map.

The CNN-S model consists of three convolutional layers with max-pooling layers located between two of them. The activation function of each convolutional layer is Relu. A small and non-zero gradient is obtained with the help of Relu and the accuracy of the CNN increases. Different from the basic CNN, the output of the third convolutional layer is imported to the fully connected layer together with the second max-pooling layer to maximize the extraction of features. In this layer, the neurons are fully connected to all activations in the former layers.

After that, we add a dropout layer to the fully connected layer to prevent co-adaptation of hidden neurons. The dropped-out neurons do nothing to the forward pass and only the neurons without dropout participate in back-propagation. Therefore, this layer helps to reduce the complex of co-adaptation of neurons. Without dropout, the experimental result exhibits substantial overfitting.

At the end of CNN-S model, the softmax layer is employed to do classification whose output is the probability distribution over labels.

The CNN-P architecture, illustrated in Fig. 3, is another variant of basic CNN architecture. The CNN-P model uses three multiple filters with different window sizes to extract multiple features. The penultimate layer is formed with these features. Thus, the features are imported to a fully connected layer. In both the two models proposed, the error between the actual output and network output are computed and minimized by being back propagated. The weights of the CNN are then further adjusted to fine-tune them.

Fig. 3
figure 3

CNN-P model

Time complexity determines the model’s training and testing time. If the complexity is too high, it will take a great deal of time to train and test a model. Due to high complexity, a model cannot be evaluated quickly or make a quick prediction. The factors that affect the time complexity of a convolutional layer include the sizes of input feature maps, the sizes of kernels, the quantities of input channels and output channels. Given D as the depth of CNN model, \(l\) as the name of a convolutional layer, \({M_l}\) as the sizes of feature maps, \({K_l}\) as the sizes of kernels, \({C_{in}}\) and \({C_{out}}\) as the quantities of input channels and output channels, the time complexity of a CNN model can be represented as:

$$Time\sim O\left(\mathop \sum \limits_{{l=1}}^{D} M_{l}^{2}*K_{l}^{2}*{C_{l - 1}}*{C_l}\right)$$
(6)

Training time can be reduced if we can reduce the dimension of input data according to the equation.

3.3 Architecture of proposed DAE-CNN

Due to the high dimensional features of Android apps, it is too expensive to train the deep neural networks. To reduce the training time and take advantage of the power of CNN, we combine DAE with CNN. Using DAE as a pre-training method can capture the essensial features of Android apps efficiently. We extract the output of hidden layer in DAE and add sparse rules to make it available for CNN. Considering the time complexity mentioned in Sect. 3.2, DAE-CNN, shown in Fig. 4, can learn more flexible patterns of training data in a short time.

Fig. 4
figure 4

DAE-CNN-S model

4 Malware detection with CNN

Figure 5 illustrates the proposed DAE-CNN model. As described in the figure, we extract features from 23,000 apps collected from various app stores and process the data to adapt into deep learning models. We use keras, the python deep learning library, to implement the deep learning models. In order to demonstrate that DAE-CNN can improve the detection accuracy, experiments with DAE-CNN and with other traditional machine learning methods are conducted. To analyze the effect of CNN models under different parameters, CNN models with different structures are applied for malware detection. We also compare DAE-CNN models with CNN models to verify that using DAE-CNN can reduce training time.

Fig. 5
figure 5

Framework of DAE-CNN models

4.1 Data preparation

We crawl 10,000 apps from Anzhi play store. Scanned by Virustotal, all the 10,000 apps are confirmed as benign apps. We collect 13,000 malicious apps from VirusShare. With 23,000 apps, we thoroughly train and test various models.

In this work, we are motivated to conduct experiments on high dimensional features of large scale Android apps. Compared with dynamic analyses, static analyses cost less in time and complexity. Therefore, we conduct static analyses to extract features from each app by Androguard and Android SDK Tools. In this way, we obtain a total of 34,570 features for each app. The seven categories of static features are permissions, requested permissions, filtered intents, restricted API calls, hardware features, code related patterns, and suspicious API calls. The number of features is too large to be processed efficiently. We reconstruct the structure of features. In detail, we encode all the features and use the feature code to indicate each app and pad where necessary with zero. Therefore, the dimension of dataset is reduced from 34,570 to 413.

To make it available for CNN model and improve the accuracy, we use \({{\varvec{x}}_{\varvec{i}}} \in {R^{256}}\) to signify the 256-dimensional feature vector corresponding to the \(i\)-th feature in the feature codes of each Android app. Therefore, each app is indicated as a matrix with size 413 × 256. Dataset is randomly divided into training data and testing data by 4:1.

4.2 Experimental setup

We conduct experiments with several variants of deep learning models and other machine learning models. All the experiments are conducted in the same circumstance with the same dataset. Keras is a high-level neural networks API. It was written in Python and is capable of running on top of TensorFlow or Theano. Deep learning models are trained and tested by calling Keras functional API.

  • Basic CNN (CNN-0) The baseline model consists of one convolutional layer, one max-pooling layer and one fully connected layer (shown in Fig. 6). The activation of convolutional layer is relu and the fully connected layer is sofmax.

  • CNN-S The structure of CNN-S model is illustrated in Fig. 2. CNN-S is trained from scratch. Kernel size and filter size are the main factors affecting the accracy of the training model. According to the dimension of input data and output data, we design the CNN-S model. The filter window of the first convolutional layer is set as 4 × 256 with 50 kernels. The pool size of the first max-pooling layer is 10 × 1. The filter window of the second convolutional layer is set as 6 × 1 with 50 kernels. The pool size of the second max-pooling layer is 6 × 1. The filter window of the third convolutional layer is set as 6 × 1. For each convolutional layer, we apply Relu, the nonlinear activation function, to achieve scale invariance. We adopt two fully connected layers to aggregate the features learned from the second pooling layer and the third convolutional layer to do classification. The detailed parameters are shown in Table 1.

  • CNN-P CNN-P is pre-trained on CNN-multichannel (Kim 2014) and then fine-tuned. Finally, the hyberparameters in CNN-P model are set as follows: three filter windows of 3 × 256, 4 × 256, 5 × 256 with 80 kernels each, the nonlinear activition function Relu, pool sizes in maxpooling layers of 411 × 1, 410 × 1409 × 1, dropout rate of 0.5, the interation number of 50. Based on the hyberparameters, we get 240 feature maps which are the most optimal features for classfication. The detailed parameters are shown in Table 2.

  • DAE We conduct four different DAE structures including 413-200-100-2, 413-200-2, 413-100-20-2, 413-200-100-20-2 on the dataset. For each structure, we apply Relu as the activation function on each layer. Threre are two keys in DAE training, one is the structure, the other is the number of iterations. Due to overfitting, a model does not always perform better as the number of interations increases. We try different numbers of interations in 413-200-100-20-2. With the increasing numbers of interation, the performance of DAE keeps (Table 3). Based on the results of DAE model, the structure of 413-200-100-20 is chosen for pretraining of CNN.

  • DAE-CNN As the layers of deep learning model increase, the number of feature maps increase exponentially. The training time grows rapidly as well. It is necessary to build a layer to reduce the dimension of input data. We use the output of the third layer of DAE (413-200-100-20-2) as the input of the first convolutional layer in CNN-S and the input of the CNN-P as well to compensate for CNN’s limitation and reduce the training time. We add the sparse rules (Glorot et al. 2011) to the 20 features of each app extracted from DAE model to apply CNN-S and CNN-P. The hyberparameters of DAE-CNN-S are set as follows: the filter window of convolutional layers are 3 × 256, 3 × 1, 2 × 1, the pool size of the pooling layers are 3 × 1, 3 × 1.The parameters of DAE-CNN-S are set as follows: the pool size of the pooling layers are 18 × 1, 17 × 1, 16 × 1.

  • Other traditional machine learning methods We compare the proposed methods with some machine learning methods mentioned in Sect. 2. We employ scikit-learn (​Pedregosa et al. 2011; Glorot et al. 2011) packages written in Python as the machine learning tools in the experiments. The methods using the same training set and testing set are given as follows: SVM, Decision Tree, Random Forest (RF), K-Nearest Neighbor (KNN).

Fig. 6
figure 6

Traditional CNN model

Table 1 Parameters of CNN-S model
Table 2 Parameters of CNN-P model
Table 3 Detection results with DAE model

4.3 Results

The results of ten experiments are illustrated in Table 4. The operating characteristics to evaluate the performance of the structures are set as follows: FPR (false positive rate), TPR (true positive rate), ACC (accuracy), Recall, PPV (positive predict value), FSCORE (the harmonic mean of precision and sensitivity: \(FSCORE=2TP/2TP+FP+FN\)), Training Time.

Table 4 Detection results

4.4 Evaluations

  • Effect of CNN structures: All the experiments are conducted in the same environment. Figure 7 shows the ACC and FPR of the proposed algorithms. It can be seen from Fig. 7 that different CNN structures have different performance. The CNN-S achieves better accuracy and F-score than basic CNN and CNN-P. Compared with traditional machine learning methods, training with CNN can improve the accuracy apparently. Compared with SVM, the accuracy with the CNN-S model is improved by 5%. The CNN models have disadvantages either. Traditional machine learning methods only need to store a matrix for prediction, while CNN has to store the whole model. The parameters in training a CNN model are thousands mutiples of parameters in traditional machine learning. Thus training machine learning models takes short time and little space compared with CNN models.

  • Effect of pre-training methods: To reduce the training time, the DAE-CNN models are proposed. As shown in Fig. 8, models with the DAE pre-training method consume less time than CNN models without DAE. The efficiency of detection with DAE-CNN-S is improved by 83% compared with CNN-S model. Although the accuracy of DAE-CNN is a little lower than CNN models, it is still higher than traditonal machine learning methods. In general, taking time and other evaluation index into consideration, DAE-CNN-S has more advantages than other methods mentioned in this paper in Android malware detection. We believe that the combination of DAE and CNN has potential ability for high accuracy when the dataset is large enough.

  • Other hyberparameters in CNN models: Using Relu (Nair and Hinton 2010) as the activation function can avoid the gradient from nonsaturate in the positive region and increase the accuracy of the CNN. We also conduct experiments using sigmoid and the results shown in Table 5 prove that Relu can improve the behavior of CNN. Dropout can be used as an efficient way to prevent overfitting.

Fig. 7
figure 7

Comparison of ACC and F-Score in ten different models

Fig. 8
figure 8

Comparation of Time and ACC between CNN models with and without DAE

Table 5 Detection results with DAE model

5 Conclusion

In this work, we propose a hybrid model for Android malware detection with DAE and CNN to improve the detection accuracy and reduce the training time. The CNN-S and CNN-P structures are employed in the training process, during which the “dropout” technique is used to prevent overfitting. Experimental results demonstrate that the proposed model is effective for large-scale Android malware detection. Compared with SVM, the accuracy of CNN-S model is improved by 5%. The time consumption for training with DAE-CNN is reduced by 83% compared with CNN-S model. In future work, we will extract more fine-grained features from Android apps and explore more effective algorithms to analyze and detect more sophisticated Android malware.