1 Introduction

Over the past few decades, Internet services have been playing an essential role in our daily lives. Consumers purchase goods from online shopping platforms, such as Amazon and eBay. Social media sites such as Facebook and Twitter allow friends to communicate with friends and allow fans to follow idols. Moreover, mobile apps and webpages often give access to useful Internet services. Some web designers seek to enhance webpages with embedded syntax and background processes. However, malicious functions in webpages can steal user information [7, 27].

Numerous types of malicious webpages currently exist, including drive-by downloads, clickjacking, phishing, social engineering attacks, and so on [24]. In drive-by downloads, users unintentionally download malware [8]. In clickjacking, a user is tricked into clicking on something different from what they perceive; thus, confidential information can be revealed [14]. Phishing attacks use e-mail or interactive webpages to steal identification [2]. Social engineering attacks trick people by using social relationships to swindle users into revealing valuable information [13]. Identifying the aforementioned attacks is difficult for many users when they access webpages by using Internet services, and these attacks often cause considerable damage [34]. Thus, an effective method to detect malicious webpages substantially helps people securely browse webpages.

In this paper, a content-aware malicious webpage detection (CAMD) method is proposed to examine webpage syntax through webpage contextual visualization. CAMD can be used to detect not only HTML style languages but also JavaScript and cascading style sheets. Natural language processing systems can utilize word2vec models [12] in their learning procedures, but programs to analyze webpage syntax confront much more complex problems and thus cannot simply use word2vec. To resolve the problem of complicated and varied webpages, the proposed CAMD uses a webpage contextual visualization as a preprocessing procedure, which retrieves the critical codes of webpages according to the Token-ASCII-Sum approach. Retrieved codes were transformed into one-dimensional grayscale images and then were entered as inputs into a convolutional neural network (CNN) to distinguish between good and bad. According to the experimental results, the CAMD provided 98.08% of predictive accuracy; the area under curve (AUC) value was 0.995 with a 2.598% false positive rate (FPR). Moreover, CAMD exhibited > 98% true positive rate (TPR) performance. The experimental results indicated that the proposed method outperformed previous approaches.

The rest of this paper is organized as follows. Section II presents related studies. Section III provides a brief introduction of deep learning and CNN concepts. Section IV provides the description of the CNN-based CAMD method. Section V presents the results of experiments performed to validate the proposed method. Section VI presents the conclusion and details of opportunities for future studies.

2 Related studies

The detection of malicious webpages presents two challenges. First, the computational complexity of a detection method should be as low as possible to maintain the browser’s performance during malicious webpage detection. Second, attack techniques evolve daily; thus, malicious webpage detection should be flexible and frequently updated. The blacklist approach used by many antivirus companies is a common method for detecting malicious webpages [11, 31]. However, the blacklist approach is hampered by many limitations. For example, a new malicious webpage can easily escape when the blacklist approach is used for detection; thus, the blacklist approach cannot overcome the second challenge. To overcome such a limitation, machine learning algorithms have been introduced for detecting malicious webpages [24, 26, 27]. Unlike the blacklist approach, machine learning algorithms exhibit an ability to detect new malicious webpages. These methods may not achieve higher accuracy, but they can be effectively selected suspicious malicious webpages. Therefore, these methods can be used to choose suspicious malicious webpages for other high-precision method that can conduct further test. Therefore, some studies have proposed machine learning to prevent the attacks of malicious webpages [1, 6, 10, 25, 26, 28, 29, 33, 37]. In [28], an automatic feature extraction method, in which a character-level embedding was combined with a CNN, was proposed, and some fragmented Uniform Resource Locator (URL) data were used as input data to detect malicious webpages. However, the feature extraction procedure of this neural network was complicated. Canali et al. [6] proposed a webpage filter, Prophiler. It retrieved various features from HyperText markup language (HTML), JavaScript, and URLs for the detection of malicious webpages. Their results indicated that Prophiler performed efficiently with little computational complexity. However, detecting malicious webpages by using the aforementioned methods in a short time is difficult, and these methods often require many computing resources. Moreover, the format of webpage features must be defined when using a parser. Therefore, it is not flexible in practical applications. Abdi et al. [1] detected malicious webpages by employing a multi-layer architecture, in which the first layer was similar to the blacklist approach. A URL was compared with an existing blacklist, and the matched URL was directly judged as the malicious webpage. When the URL passed the first layer, a term frequency-inverse document frequency (TF-IDF) statistical method and word2vec neural network were used to extract a URL feature. Subsequently, the CNN, a support vector machine, and linear regression were used to determine whether the URL was malicious. This method can be useful for detecting malicious URLs, but the blacklist must be maintained and updated frequently. Additionally, some URLs are exploited to redirect to malicious webpages due to their forwarding mechanism. However, to acquire user’s security information, malicious websites usually disguise itself as normal websites. To detect the abnormal attacks hidden in the malicious websites, various deep learning models are efficient for the malicious webpage detection [21, 35, 36]. In [36], a convolutional gated-recurrent-unit (GRU) neural network is proposed for malicious Uniform Resource Locators (URLs) detection. The URLs are composed of characters which are used to retrieve the text features for classification. A feature representation views that the malicious keywords are unique for the URLs, so it will use the malicious keywords to form a representation for the (GRU) model as input. The results show that the feature representation method helps the (GRU) model obtain high accuracy. In [35], it proposes an algorithm, called URL embedding (UE), which is used to find coefficients of URLs. The method uses a distributed feature representation for URLs to avoid the curse of dimensionality. Using the UE method, it will obtain a low-dimensional vector which includes important feature information and helps deep learning model obtain high accuracy for malicious attack detection. In [21], it proposes a malicious website detection method from user’s perspective for detecting web spams. The method based on a Convolutional Neural Network uses as a classification algorithm and it also discusses various Web spam techniques. In [5], it introduces a MSCA-net to integrates Multi-Scale Contextual Feature Fusion and Multi-Context Attentional Feature. The net captures contextual information to accurate the image segmentation. The model can improve the conventional segmentation method for decision-making. Moreover, many crucial deep learning models, such as LeNet [19], AlexNet [18], GoogLeNet [32], and VGGNet [30] are used in different fields, such as game [23], internet [9] and medical applications [3, 4] to perform classification and recognition applications. These famous models provide excellent performance. However, the proposed CAMD model extracts webpage syntax using HTML, CSS and Javascript code, through webpage contextual visualization. The Token-ASCII-Sum method extracts codes from original webpage and transforms into one-dimensional grayscale images which is from the feature representation. Because the proposed model is an approach of the content-aware malicious webpage detection. When the model meets a malicious webpage and the order of the codes in the malicious webpage is reorganized to a harmless webpage. The proposed model cannot be influenced by the reorganization and can also classify the webpage to normal or malicious one. Therefore, in the experimental section, the model can be implemented in real situation.

3 Content-aware malicious webpage detection using CNN

The CNN is also used to distinguish between normal and malicious webpages. The problem of malicious webpage detection can be considered a binary classification problem, and the detection result should be normal or malicious webpages. Accordingly, the problem of malicious webpage detection is defined as follows.

A data set D contains various original webpages, D = {(t1,y1),(t2,y2),...,(ti,yi)}, where i = 1,2,3,...,|D|; |D| is the cardinality of the data set; and yi ∈{0,1}. yi = 0 represents that ti is a malicious webpage, and yi = 1 represents that ti is a normal webpage.

Figure 1 reveals that the CAMD architecture includes three parts: (A) data preprocessing, (B) WcvNet model training, and (C) WcvNet model testing.

Fig. 1
figure 1

CAMD architecture

3.1 Data preprocessing

Data was preprocessed to retrieve critical codes from a webpage under test to a visible vector, which can be further represented as a grayscale image. The preprocessing procedure, Token-ASCII-Sum, comprises two steps. The first step converts the critical code vector W of an under test webpage ti into a numerical vector X. Subsequently, the numerical vector X was transformed into a visualized vector V to filter noises and increase the prediction accuracy.

In the first step, a regular expression was employed for word string comparison. Figure 2 reveals that a regular expression instruction [∖w+] was applied to retrieve the webpage content, and the results are marked as blue blocks. By using the instruction [∖w+], English uppercase and lowercase alphabets (a-z and A-Z), numbers (0–9) and underscores (_ ) from the original webpages were retrieved, and the remainder, for example Chinese characters and symbols, were ignored. Figure 2 presents an example wherein a total of 65 tokens (strings), that is, wi and i = 1,2,...,65, are marked. All 65 wis formed a token vector W, where W = {wi,∀i}. Subsequently, each wis was transferred into a corresponding value xi, and all xis formed the numerical vector X, that is, X = {xi,∀i}. The Token-ASCII-Sum transferring is presented as (1).

Fig. 2
figure 2

Example of word string comparison by using regular expression

$$ \begin{array}{@{}rcl@{}} x_i = \big((|w_i|-1)\mod N & \big) \times \lceil \frac{65535}{N} \rceil + \\ &\big(\sum\limits_{CT=1}^{|w_i|}{A_{CT}} \big) \mod \lceil \frac{65535}{N} \rceil \end{array} $$
(1)

where |wi| is the length of token wi; N represents the pre-defined constant number represents the predefined constant number, which is used to partition a long token; CT is the ordinal number of character of token wi; ACT denotes the corresponding ASCII code of each CT; and the number 65536 is the numerical range of 16 bits grayscale image. In our proposed method, when the Token-ASCII-Sum process uses fewer bits to represent grayscale images, some important features may be ignored in the transformation from original webpages to grayscale images. In other words, the grayscale image cannot keep sufficient feature information. On the contrary, the more bits used for representing the grayscale, the higher accuracy of classification can be obtained; however, it needs more hidden layers for the proposed model and results in performance penalty. In real situation, the end-to-end predication of model must cost much time for the recall process. So, the 16 bits (0-65535) are chosen for the Token-ASCII-Sum process.

Considering the string “holder” an example, the parameters of the string were set as |wi| = 6, CT = 1, and N = 4. The ASCII code of each character was as follows: h = 104, o = 111, l = 108, d = 100, e = 101, and r = 114; consequently, the summation of ACT of this string was 638. According to (1), the xi of token “holder” was 17022. When the token vector W = “This, is, a, pen, holder” was retrieved from a certain webpage, the corresponding numerical vector was X = [49560, 16604, 97, 33091, 17022], which is a 5D vector, and each element \(x_{i} \in \mathbb {R}^{5}\) is a scalar. The N value affected the accuracy of CAMD final decision. In the second step of data preprocessing, the numerical vector X was transformed into a visualized vector V.

Consider another token W = [a, aa, aaa, aaaa, aaa…a1, aaa…a2] an example, where aaa…a1 exhibits 32 continuous characters ’a’ and aaa…a2 exhibits 33 continuous characters ’a’. When N was set to 32 by following the process presented in the first step, the corresponding numerical vector X was [97, 2242, 4387, 6532, 64544, 1153]. To obtain the visualized vector of webpage ti, each xi of X was mapped to a grayscale level image. Figure 3 illustrates that the grayscale level was set from 0–65535 (the last value is equal to 216 − 1). The mapping result was used to form a visualized vector V = {vi,∀xiX and \( x_{i} \rightarrow v_{i} \big \}\). Figure 3 presents the mapping result of W. When xi was small, the Binarization of the grayscale image was close to black color; otherwise, it was close to white color.

Fig. 3
figure 3

Process of the transferring the visible vector

Figure 4 presents two examples of transformation from a source webpage to a visualized vector. Figure 4(a) is a example of a malicious webpage. Figure 4(b) is a example of a harmless webpage. This visualized vector was a 1 × 17 vector represented by a grayscale image, which was also the input data of the proposed deep neural network. One webpage under test was converted to a grayscale image to retain the spatiality of the original text of the webpage. However, the sizes of token vectors retrieved from different webpages varied, which complicated the CNN training process. In this study, the token vector size was unified as 1 × 14400, because the maximum token size retrieved from webpages by applying regular expression instruction [∖w+] was 14339. After the transformation of the token vector into a numerical vector, if the size of the numerical vector was smaller than 1 × 14400, the insufficient parts of the numerical vector were added by zeros, which is zero-padding.

Fig. 4
figure 4

Example of transformation of a source file of a webpage into a visualized vector. (a) An example of a malicious webpage (b) An example of a harmless webpage

3.2 WcvNet model training

To classify malicious and normal webpages, a novel WcvNet neural network was proposed, which was a GoogLeNet-based CNN model. GoogLeNet, which was proposed by Szegedy et al. [32] in 2014, enhances the network width and uses different sizes such as 1 × 1, 3 × 3, and 5 × 5 for extracting features. The 1 × 1 convolution operation could reduce the dimension of feature maps to avoid large parameters after deepening and widening of networks.

All convolution layers and pooling layers in the WcvNet were designed specifically for training 1D grayscale images, where the feature detector used in convolution layers exhibited the smallest size of 1 pixel and the largest size of 5 pixels. By matching the 1-pixel stride, the continuous syntax fragment in web source files was extracted to further interpret the syntax fragments of web source files with different lengths. The 3-pixel pooling kernel used in the pooling layer could concentrate features from the convolution feature map of the previous layer and retain the adjacent relation between the context of the web source file. When the model entered the inception layer, the pooling stride was set to 2 to highly reduce operation parameters and balance the accuracy and effectiveness. Figure 5 illustrates the WcvNet architecture, where conv. 1 × 5 + 1 is the feature detector of size 1 × 5 with one-pixel stride used in convolution layer, and max-pooling 1 × 3 + 1 represents the pooling kernel size of 1 × 3 with one-pixel stride.

Fig. 5
figure 5

Architecture of the proposed WcvNet network

In the activation function of WcvNet, ReLU was used for all hidden layers and a sigmoid was used in the output layer. Convolutional layer, activation function, fully connected layer, and three optimal elements, namely batch normalization, global max-pooling, and activity regularization, were employed in WcvNet.

3.2.1 Batch Normalization (BN)

BN is used in the WcvNet network for processing the input data to ensure that the data remains in an effective range when it is transmitted from one layer to another. The advantage of BN is to maintain the distribution of original data; thus, it is a method for optimizing neural networks. BN is often introduced after the convolution layer or fully connected layer. This method divides the data into small batches for stochastic gradient descent. When each separated batch of data was in forward propagation, each layer was normalized as illustrated in Fig. 6. A min-batch, B = {z1,z2,...,zj}, where zj represents a weight value of a neuron and j = 1,2,...,|B|. During training, each mini-batch must be used to calculate the mini-batch mean, μB, and mini-batch variance, \({\sigma _{B}^{2}}\), as follows.

$$ \mu_{B} = \frac{1}{|B|} \sum\limits_{j = 1}^{|B|}z_{j} $$
(2)
$$ {\sigma_{B}^{2}} = \frac{1}{|B|} \sum\limits_{j = 1}^{|B|}{z_{j}^{2}} - {\mu_{B}^{2}} $$
(3)

where \(\sqrt {{\sigma _{B}^{2}}}\) is the standard deviation of the variance, \({\sigma _{B}^{2}}\).

Fig. 6
figure 6

Example of batch normalization from one layer to another layer

When the mini-batch mean and mini-batch variance were obtained, the value, zj, was used for normalization to form a value, \(\hat {z}_{j}\), as follows.

$$ \hat{z}_{j} = \frac{z_{j}-\mu_{B}}{\sqrt{{\sigma_{B}^{2}}+\epsilon}} $$
(4)

where 𝜖 is an arbitrarily small constant, which is added in the denominator to avoid the problem of dividing by zero.

The obtained normalized value, \(\hat {z}_{j}\), can directly entered as an input into the activation function. The normalized value was further scale and shift listed as follows.

$$ v_{j} = \alpha \hat{z}_{j} + \gamma $$
(5)

where α and γ are the translation and extended parameters, respectively. The neural network used them to scale and modify the normalized value, \(\hat {z}_{j}\). By using these two parameters, the WcvNet network slowly realized whether the normalization operation achieved optimization. When the normalization operation did not work well, the two parameters can cancel out the operation [15]. When the network experiences slow convergence, gradient vanishing, and gradient exploding, the BN method can be introduced to resolve these problems.

3.2.2 Global Max-Pooling (GMP)

In original fully connected layers of CNN as described in Section III, the GMP method [20] is employed to determine the maximum value of feature map generated by the previous convolution layer, to reduce the dimension, and to prevent overfitting. For using the GMP method, considering the number of neurons required by the fully connected layer is not necessary.

3.2.3 L1 activity regularization

WcvNet is used to decrease the error of the loss function L, which is given as follows.

$$ L = (d - v)^{2} $$
(6)

where d and v represent the desired output and output of the network, respectively.

Generally, the loss function is used to evaluate the difference between the desire output and output of the neural network. L1 regularization is used to add the summation of the absolute values of weight as a penalty term to the loss function. The loss function using L1 regularization becomes smoother and reduces the noise influence. The penalty term is given as follows.

$$ L = (d - v)^{2} + \lambda \sum |Weights| $$
(7)

where λ is constant used to control the regularization strength. The penalty term enabled the neural network to learn sparse features and prevent overfitting. WcvNet employed L1 activity regularization for the fully connected layer.

After the completion of the sigmoid function, WcvNet exhibited only one output value, which was located in the interval (0, 1). An output value larger than a given threshold indicated the malicious webpage, whereas the output value smaller than the given threshold suggested the normal webpage.

3.3 WcvNet model testing

To detect malicious webpages, the proposed WcvNet model was used to minimize the loss function and to increase the prediction accuracy. During model testing, a testing data set that was not to be used for model training are employed to evaluate the loss value of the WcvNet model.

To obtain the superior classification result, an adaptive learning rate tuning method was employed in the testing phase to reduce the loss value by few epochs. The adaptive learning rate tuning method was used to set an initial learning rate for renewing the weights of all neurons, and then, in each epoch, the loss value was calculated and recorded. For loss value optimization, adaptive moment estimation (Adam) [16] was used. Adam, a self-adaptive learning rate algorithm, was employed to maintain the learning rate of each epoch in a stable parameter range after the correction of training parameters. Therefore, Adam was used to rapidly decreased the loss value and can often be utilized to optimize the CNN.

4 Experiment

Training and verification data sets used in this study were obtained from the network threat intelligence website, VirusTotal [17, 33]. VirusTotal receives numerous HTML files every day and judges the malicious content by using 60 network threat scanning mechanisms offered by 12 information security businesses. As a preventive measure , the acquired data sets were scanned with an antivirus software in personal computers to determine whether the source data was normal or malicious before the experiment. Table 1 presents the quantity and distribution proportion of data sets. A total of 100,000 webpages were adopted, including 50,000 malicious webpages and 50,000 normal webpages.

Table 1 Dataset structure of this study

4.1 Comparison between the different settings of parameters

In model training, a large amount of training data restricted to hardware operation resources cannot be loaded simultaneously . Generally, for training, the entire data sets are generally input to the model as batches. The batch size used for the training model was 20, the iteration was 1000, and the epoch was 40. Such a distribution, under the lab environment and the highest operation resource utilization provided the most favorable balance for the model training effectiveness.

To obtain the accurate classification result from the WcvNet, reducing the loss value down to the lowest in training is essential. Figure 7 presents the learning curve of the loss value, where the x-axis and y-axis represent the epoch and loss value, respectively. The variation of the loss value is recorded in the Fig. 7. In the 1st epoch, the error is approximately 0.371 and the accuracy approximately reaches 86%. In the 38th epoch, the error lowest loss value was 0.126 and the accuracy can reach 98.1%. However, from 22nd to 37th epoch, the accuracy approximately reaches 97%. The detailed information of error and accuracy with different epochs are recorded in Figs. 7 and 8.

Fig. 7
figure 7

Learning curve of the loss value

Fig. 8
figure 8

Accuracy of WcvNet

When the loss value of the network did not decrease after two epochs, then a half of the current learning rate was automatically reduced to properly converge the loss value. After model training and verification with several parameter combinations, the superior initial learning rate was 0.0001, the loss automatic reduction tolerance was 2, and the loss automatic reduction ratio was 50%.

The N value influences the accuracy of CAMD’s final decision. To investigate the effect of parameter N, separated tokens with different N were compared, and the results are presented in Fig. 9, where the x-axis and y-axis represent the epoch and prediction accuracy, respectively. Figure 9 indicates that when N = 32, the proposed CAMD obtained the highest accuracy; thus, parameter N was set to 32 in the subsequent experiments.

Fig. 9
figure 9

Comparison of separated tokens with different N

4.2 Comparison of the CAMD Architecture with Previous Methods

Three crucial deep learning models, LeNet [19], AlexNet [18] and hierarchical inspector approach (HIA) [29], were used to evaluate the accuracy of malicious webpage decision and were compared with the proposed CAMD. The same data preprocessing phase was used for LeNet and AlexNet, and the same grayscale images were inputted in these models. The HIA model uses regular expression to retrieve source webpages to form a token stream which is divided into 16 equal parts. Then, it executes the hashing trick for each division and uses hierarchical features as input data for the processes of feature extraction and classification. Table 2 presents the total numbers of parameters used in the CAMD, LeNet, and AlexNet, and the proposed model exhibits the lowest number of the parameters.

Table 2 Numbers of parameters used for three models

In model training, the training data with labels were entered as inputs to train the network and adjusts the neuron weights. The validation data were used to test the classifier. The CAMD is a binary classifier used for the malicious and normal webpage determination. The threshold for the binary classification of malicious and normal webs was 0.5; a value > 0.5 was considered positive (positive/malicious web) and that < 0.5 was considered negative (negative/normal web). True positive (TP), false negative (FN), true negative (TN), false positive (FP), and commonly used extended evaluation standards in the confusion matrix for binary classification were used for evaluating the model reliability. A total of 10,000 pieces of verification data were entered as input in CAMD, HIA, LeNet, and AlexNet. Table 3 presents the confusion matrices.

Table 3 Confusion Matrices of Four Methods: (a) CAMD, (b) HIA, (c) LeNet, and (d) AlexNet

After the generation of the confusion matrices of model evaluation, the indicators, including TP, FN, TN, FP, accuracy, precision, recall, F1-score, true positive rate (TPR), and false positive rate (FPR), were calculated to evaluate the accuracy of the deep learning model. The relevant evaluation indicators are given as follows.

$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
(8)
$$ Recall = \frac{TP}{TP+FN} $$
(9)
$$ Precision = \frac{TP}{TP+FP} $$
(10)
$$ \begin{array}{@{}rcl@{}} F_{\upbeta-score} = &(1+{\upbeta}^2)\frac{Precision \times Recall}{{\upbeta}^2 \times Precision+Recall} \\ if \quad \upbeta = 1 \\ & F_{1-score}=2\times\frac{Precision \times Recall}{Precision+Recall} \end{array} $$
(11)
$$ FPR = \frac{FP}{TN+FP} $$
(12)

where TPR is equal to Recall.

Figure 10 presents the model evaluation data obtained using these equations, where the x-axis and y-axis represent indicators and the validation index percentage, respectively. Figure 10 illustrates that the proposed CAMD provided the accurate evaluation of accuracy, precision, and F1-score. In addition, the validation indices of TPR and FPR were calculated, as presented in Fig. 11, where the x-axis and y-axis represent the index of FPR and TPR, respectively. When the CAMD was 2.5% FPR, it achieved 98% TPR. The proposed CAMD achieved high predictive accuracy.

Fig. 10
figure 10

Validation indices for each method

Fig. 11
figure 11

TPR and FPR indices for each method

The evaluation data were measured based on a classification threshold of 0.5. To evaluate the performance, a receiver operating characteristic (ROC) curve was employed with different thresholds were used for binary classification as the measurement benchmark. The ROC curve was closer to the upper left corner to reveal the fixed cost (FPR) corresponding to the benefit (TPR), that is the high classification ability of the model. The ROC curve presented in Fig. 12 presents the TPR performance of the four methods under distinct FPR. The ROC curve of CAMD, closest to the upper left corner, exhibits the highest web classification ability with an AUC of 0.995. A large AUC value indicated that the ROC curve is closer to the upper left corner of the figure. The larger the AUC value of the model is, the higher the classification results are.

Fig. 12
figure 12

ROC curves

Table 4 presents the AUC values of the four methods. The AUC value of CAMD was 0.995, which is the highest among the four methods.

Table 4 AUC values of the four methods

4.3 Real example of model prediction in the CAMD and the HIA

Model prediction was performed to evaluate a new data, which was not included in the training and validation datasets. Compared with the previous studies [22, 24, 27], most of them utilized statistics method, i.e., calculating the appearance frequency of special words, to achieve high detection accuracy. However, such method is easy to misjudge the malicious and normal webpages. Our proposed model uses the relation of contextual codes to detect the malicious as well as normal webpages. An under-testing webpage (Fig. 13) was used to examine the prediction accuracy of the proposed CAMD and HIA model. Figure 13(a) is a malicious webpage including a fragment of malicious syntax, and both the proposed model and the HIA [29] can detect the malicious webpage. However, when the order of the codes in the malicious webpage is reorganized or embedded additional syntax between tag <script language = “javascript”> and tag </script> to form a harmless webpage as shown in Fig. 13(b), the other methods cannot judge the harmless webpage correctly, while the proposed method can detect it correctly. In our experiment, the results specify that the detection accuracy reaches 98%. In other words, 2% of source webpages can evade. To increase the accuracy, the Token-ASCII-Sum process must use more bits to represent the grayscale images so that more features of the source webpages can be captured. However, in real situation, more bits used for grayscale image representation indicate that much more time is needed for recall process in the predication model. Consequently, the 16-bit (0-65535) grayscale image is chosen for the Token-ASCII-Sum process in this paper. Table 5 presents the results. The threshold for the CAMD was set to 0.5 to classify malicious and normal webpages. When the predictive value was ≥ 0.5, the webpage was considered a malicious webpage. When the webpage was reorganized and the malicious part was removed from the webpage, the CAMD was able to distinguish it. This result indicated that the proposed CAMD achieved a high predictive accuracy and reduced the FPR.

Fig. 13
figure 13

Webpage under testing. (a) malicious webpage (b) Content of malicious webpage is reorganized into the harmless webpage

Table 5 Predictive results using neural networks

5 Conclusion

To effectively detect changeable malicious webpages, Token-ASCII-Sum data preprocessing for extracting webpage source file features is presented in this paper. By transforming of a web source file into grayscale images and improvements to GoogLeNet, which provides excellent image recognition, a novel CNN model, WcvNet, specifically for malicious and normal web classification was proposed. With fine adjustments and optimization methods, the proposed method can achieve 98.08% forecast accuracy, with an AUC of up to 0.995. Under 2.5% FPR, it can exceed 98% TPR performance. Compared with the method employing word frequency statistics and those having excellent classification ability, the proposed method could avoid false output after web content recombination and model prediction.

The proposed malicious web detection method is currently restricted to the dataset source and for training binary classification models. The future research could collect various types of malicious web datasets, train models for distinct malicious web classification, and provide attack types with proper warnings to users to prevent unnecessary infringement and loss.