Keywords

1 Introduction

In recent years, the expansion of the Internet, modern technology and ease of communication paved the way for fraudsters and criminals to conduct their fraudulent activities which results in the loss of billions of dollars worldwide each year [1]. Websites are great tool to access information, services and product, however like any other online service, they can be utilized by fraudsters to prey on victims and propagate fraud. Fraudulent websites are usually posing as legitimate online sources of information, goods, product and services [2]. Fraudulent investment websites such as foreign currency exchange (Forex), gold and other precious metal investment, Ponzi, pyramid schemes and Multi-Level Marketing, online shopping and E-commerce website are the most common type of fraudulent websites [3, 4]. Due to various undesirable impacts of fraudulent websites, several studies and approaches are emerged to detect fraudulent websites. Despite the efforts, capabilities of these approaches are limited and they are unable to keep up with growth and diversity of fraudulent websites. Thus, the existing measures are fairly poor in terms of their ability to detect fraudulent websites [5]. The major challenges in fraudulent websites detection are: first, new generation of web technologies increased the complexity of web scrapping and limits the access of fraudulent website detection to the web contents. Second, diversity in types of web fraudulent activities (E-commerce fraud, MLM, Forex, etc.) challenge prescription of a global solution for fraudulent website detection. Third, viral growth of fraudulent websites, leads to obsolescence of static countermeasures and demands for a dynamic solution and lastly, fraudsters’ efforts to disguise, mislead, block and bypass the fraudulent website detection models, leads to ineffectiveness of these models.

An effective fraudulent website detection model should be able to address these challenges and deliver accurate results within a reasonable time frame. This study proposes a fraudulent website detection model based on sentiment analysis of textual contents of a given website and supervised machine learning techniques. The proposed model was deployed in the Hadoop Big Data platform to ensure that it can keep up with vast amount and viral growth of fraudulent websites. It also employs content based machine learning techniques which increases its robustness against new types of fraudulent websites. Furthermore, a tailored crawling technique ensures the, availability of the required textual contents regardless of the web technology and type of the fraudulent website. The next section briefly describes the state-of-the-art fraudulent website detection models and their capabilities and limitation in more details.

2 Related Work

Content based fraudulent website detection methods rely on the website contents, components and metadata such as domain registration information, body text style features, HTML features, URL and anchor text features, image features, links, etc. in order to detect fraudulent websites. Various studies utilized these components to identify the legitimacy or fraudulency of a given website.

Le et al. [6] used lexical features of the URLs such as URL tokens, lexical and syntactic measures and domain registration information to identify fraudulent website. They claimed these features are immune against obfuscation techniques used by fraudsters. Different classification techniques including batch-learning, Support Vector Machine as well as online learning algorithms such as Online Perceptron, Confidence-weighted and Adaptive Regularization of Weights have been used in this study. Their evaluation results showed that Adaptive Regularization of Weights with accuracy of 96% outperformed other classification techniques.

In another study, Abbasi et al. [7] proposed a new fraudulent website detection systems based on statistical learning theory (SLT). Combination of textual, URL features, source code, images and linkage features were used to identify the fraudulent websites. Their evaluation results showed the accuracy of 96% using SLT in a dataset of 900 fraudulent websites. Another study by Abbasi et al. [7] attempted to detect fake escrow website using relatively similar features vector. Their proposed technique achieved accuracy of 98% using the kernel SVM classification technique.

Martinez and Araujo [8] proposed a scam website detection system using language model analysis. They applied a language model approach to different sources of information extracted from a given website to extract features for scam website detection. Their system relies on the hypothesis that two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. Three sources of features from the source page including Anchor Text, Surrounding Anchor Text and URL terms as well as three sources of information from the target page including Title, Content Page and Meta Tag were used to construct the feature vector. Their proposed model achieved F-score of 81% in detection of scam websites.

A study performed by Urvoy et al. [9] proposed an spam website detection model based on a style similarity measures of textual features in html source code. Jaccard similarity index has been used to measure the similarities. They also proposed a method to cluster a large collection of documents according to this measure. Their proposed technique is particularly useful to detect pages across different sites which sharing the same design. In a similar fashion, several other studies including [11,12,13,14] attempted to identify spam website using supervised machine learning techniques.

3 Methodology

Among the various components of a website which have been mentioned earlier, textual contents are the primary and presumably the largest element of a webpage. Textual contents are often explicitly imply information about the semantic of that webpage which can be employed as primary source of discrimination features for fraudulent website detection. Various types of discriminative features such as Lexical, Syntactic, Structural, Content-specific and Idiosyncratic can be extracted form textual information of a webpage. The proposed fraudulent website detection model employs Natural Language Processing (NLP) techniques which let us to go beyond rudimentary statistical features and extract the semantic features of the text through the use of natural language features [15, 16]. This study incorporate NLP techniques, textual features and machine learning techniques in order to construct fraudulent website detention model. The proposed model consists of four primary phases including data acquisition phase, preprocessing phase, feature extraction phase and classification phase. The following sections describe each phase in details. Figure 1 shows the proposed fraudulent website detection model framework.

Fig. 1.
figure 1

The proposed content based fraudulent website detection model.

3.1 Data Acquisition Phase

This phase is aimed to extracts textural raw information and metadata from a given website. Web crawling and scrapping are two major operation which used in this phase.

Despite the seemingly simple operation of web crawling and scrapping, majority of the webpages devised restrictive measures such as browser ID check, CAPTCHA, IP monitoring, query limit, etc. to block the unknown crawlers and reduce unwanted load to their servers. Majority of the modern crawlers and scrappers are employing sophisticated techniques to bypass the restrictions imposed by websites administration. Also, diversity in web technologies, programming languages and styles make the web crawling more difficult than ever before. To tackle these issues, this study used Pyspider which provides highly customizable application, framework and libraries to crawl the target websites. Pyspider is an application framework for crawling web sites and extracting structured and unstructured data which can be used for data mining, information processing or historical archival. Web scraping software will automatically load and extract data from multiple pages of websites based on the given criteria and parameters. It is either custom built for a specific website or is one which can be configured to work with any website. In order to extract the textual data from a given webpage, scrappers are usually parse for the HTML tags and class objects such as <title> …</title> , <p>… </p> , <div> …</div> which are usually indicate the present the textual contents. Our scrapper only collects textual data from the home page as well as first layer links. We believe this setup is adequate to collect enough textual data to identify website sentiment. Once textual contents of a given webpage scrapped, it will be stored in a dataset for subsequent mining and analysis operations. In this study total number of 430 website has been crawled and scrapped among which 257 denote the fraudulent website and the reset represents the legitimate ones. Textual content of these websites is then dumped in text files to facilitate the data cleaning and wrangling process.

3.2 Preprocessing Phase

Regardless of web scrapper precision, there are always significant amount of unwanted noise such as HTML or style tags incorporated into the actual textual data. These noise patterns which negatively affect the fraudulent website performance, should be removed prior to feature extraction and classification phase. This study employs several preprocessing techniques in order to remove unwanted noise and refine the textural data. The following explain each preprocessing technique in more details:

  1. i.

    Tag Removal: Despite majority of the source code tags has removed from the textual data in scrapping phase, in some cases these tags can be sneaked into the textual data. These tags has been removed prior to any feature extraction process.

  2. ii.

    Stop Words Removal: Stop-words are repeatedly occurring, insignificant words in a language but do not reflect any semantic of the documents. Articles, prepositions and conjunctions and some pronouns are common stop-words in English. Stop-words have been removed prior to any feature extraction operations.

  3. iii.

    Punctuation Removal: Similar to stop words, punctuation marks are frequently occurring in the language but do not reflect any semantic of the given document. Punctuations have been removed prior to any feature extraction operations.

  4. iv.

    Capital Letters Removal: Since capital letters do not affect the semantic of a given word, all capital letters throughout the document transformed to the small letters. This increases the uniformity of the document and cuts the redundancy.

  5. v.

    Stemming: Stemming is the process of shrinking words into their stems or roots. For example “computer”, “computing”, and “compute” are reduced to “comput” and “walks”, “walking” and “walker” are reduced to “walk”. Martin Porter’s stemming algorithm used in this study. Stemming increases the accuracy and reduces the size of the text block by factorizing the words to their stems.

  6. vi.

    Tokenization: Tokenization can be performed at different scales such as words, phrases and sentence. Since this research uses BOW technique in feature extraction phase, we have employed tokenization operation in “word” scale.

After preprocessing operation the input raw textual documents are transformed to a refined vector of words which used in subsequent feature extraction phase.

3.3 Feature Extraction Phase

Feature extraction is the most important stage in fraudulent website detection using natural language processing. Various types of discriminative features which can be extracted form webpage body text. This research employed Bag-of-Words technique and Part-of-Speech tags in order to construct the feature vector and segregate the fraudulent websites. Bag-of-Words model is a simple yet efficient technique used in NLP. It makes a unigram model of the text by keeping track of the number of occurrences of each word in a given document. This technique creates a corpus with word counts for each data instance (document). Based on the average size of each document, word counts can be either absolute, binary (contains or does not contain) or sublinear (logarithm of the term frequency). For the purpose of this study, absolute word count generate superior results in comparison to other word count techniques. Figure 2 shows Bag-of-Words pseudocode.

Fig. 2.
figure 2

Bag-of-Words pseudocode.

Part-of-Speech tagging is the process of marking up a word in a given body of text corresponding to a particular part of speech, based on both its definition and its context for example nouns, verbs, adjectives, adverbs, etc. Several pre-trained part of speech taggers are publicly available online. One major reason to use Part-of-Speech tagging is to extract name entities and employ them in feature extraction process. In comparison to other parts of the speech, name entities provide significantly higher discriminative power which can be beneficial in feature extraction process. we have employed Brill rule based POS tagging technique in this study [17].

3.4 Classification Phase

This study employs ensemble of classifiers and a voting scheme to increase the reliability of the prediction. Combination of six classifiers including Naïve Bayes, Multi-nomial Naïve Bayes, Bernoulli, Logistic Regression, Stochastic gradient descent classifier and Support vector machine used to construct the ensemble classifier and voting scheme. Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem with an assumption of conditional independence of predictors. In training stage, it estimates the parameters of a probability distribution while in the testing stage it computes the posterior probability of that test sample belonging to each class. Despite simplicity, Naïve Bayes perform reasonably well for text classification purposes. Multinomial Naïve Bayes is a variant of Naïve Bayes which is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. Bernoulli is basically a Naïve Bayes classifier for multivariate Bernoulli models. Like Multinomial Naïve Bayes, this classifier is suitable for discrete data. The difference is that while Multinomial Naïve Bayes works with occurrence counts, Bernoulli is designed for binary/boolean features. Support Vector Machine is a two-class classifier which mainly intended to find the hyper plane which separates two classes with maximum marginal distance between them. It generates satisfactory result in small to medium scale dataset. Stochastic Gradient Descent (SGD) estimator implements regularized linear models with stochastic gradient descent (SGD) learning. The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (learning rate). Logistic Regression used to predict a binary outcome by using binomial Logistic Regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression. The voting scheme is designed based on these classifiers. For example if four out of six classifiers vote for fraudulency of a website, the voting scheme classifies that website as fraudulent. 10-fold stratified cross validation technique is employed to validate the performance of the proposed model.

4 Results and Analysis

A total number of 430 website has been crawled and scrapped where 257 are fraudulent website and the rest are legitimate websites. 10 fold stratified cross validation technique was used to validate the results. Among the six classifiers used in this study, Logistic Regression and Multinomial Naïve Bayes Classifiers have outperformed others in terms of accuracy, F-score and FPR. Logistic regression with respective Accuracy, F-score, and FPR of 98.83%, 98.32%, 2.55% on 10 fold cross validation, generates the best results. Multinomial Naïve Bayes Classifier with respective Accuracy, F-score, and FPR of 99.41%, 99.11%, 1.64% marginally underperform the Logistic Regression and is among the best classifiers in this research. SVM classifier with respective accuracy, F-score and FPR of 88.37%, 87.38%, 14.01% is the worst performer in this study. Despite, Logistic Regression and Multinomial Naïve Bayes classifiers produce superior results than other classifiers in this study. Their performance are very much depends on the quality and type of input data and might fluctuate significantly as changes happen to input data. Voted classifier can significantly mitigate this issue and improve the robustness of the proposed model and increase the consistency of the results. The proposed voted classifier with respective Accuracy, F-score, and FPR of 97.67%, 97.25%, 3.49% though does not yields cutting edge results, but ensures consistent performance regardless of the fluctuations in input data. Table 1 also shows nearly perfect result in training stage which indicates high generalizability and discrimination power of the proposed feature vector. Table 1 shows the evaluation metrics of the proposed fraudulent website detection model.

Table 1. The proposed model performance using training set and cross validation.

Figure 3 shows the list of 20 discriminative words which appear in fraudulent website with relatively high frequency. Words like “user” “fast” “safe” “payment” “hour” are the most discriminative terms in fraudulent website detection. These words are usually appearing in phrases which aimed to lure the victims.

Fig. 3.
figure 3

Most discriminative words.

Figure 4 shows the ROC (Receiver Operating Characteristic) curves of the proposed fraudulent website detection model across different classifiers including Naïve Bayes, Multinomial Naïve Bayes, Bernoulli, SVM, Stochastic Gradient Descent and Logistic Regression. In this research eight different thresholds have been used to generate the ROC curve and measure the AUC (Area Under Curve). Figure 4 indicates that apart from SVM classifier, all other classifiers have relatively similar performance. SVM has relatively unsatisfactory performance in classification of bag of word features. Meanwhile Naïve Bayes and Logistic Regression classifiers marginally outperform other classifiers. The AUC (Area Under Curve) of each ROC curve is shown in Table 2. One major drawback of the existing solution is lack of scalability. Existing techniques might initially generate eye-catching results however these techniques are unable to keep pace with viral growth of fraudulent website. The proposed model deployed in the Big Data platform to ensure that it can keep up with vast amount and viral growth of fraudulent websites. Another major drawback of the existing fraudulent website detection models is their limited capabilities in data acquisition. Recent web programming languages and technologies made the web crawling and scrapping process more difficult than ever before. This study tackled this issue by means of efficient crawling engine which is able to deliver desired textual information regardless of the employed web technologies or deliberate limitations imposed by the websites. Using ensemble of classifiers, this study also managed to maintain its robustness against diversity and variety of the fraudulent website. One classification model is not able to perfectly generalized wide variety of fraudulent website.

Fig. 4.
figure 4

ROC Curves of the proposed fraudulent website detection model across different classifiers.

Table 2. Area Under Curve (AUC) of the proposed fraudulent website detection model across different classifiers

5 Conclusion

Websites has turned into a platform for fraudsters to prey on victims and propagate fraud and cybercrime. Despite the efforts of researchers, majority of the measures are unable to keep pace with viral growth and diversity of fraudulent websites. This study attempted to address this issue by proposing a content based fraudulent website detection model which utilizes textual contents of web, natural language processing techniques and supervised classification techniques to counter fraudulent websites. Experiment results showed that the proposed fraudulent website detection model with cross validated accuracy of 97.67% and FPR of 3.49% achieved satisfactory results and served the aim of this research.