Keywords

1 Introduction

As the user base of Twitter steadily grows into the millions, real time search systems and different types of mining tools are emerging to enable people to track events and news on Twitter. These services are appealing mechanisms to ease the spread of news and allow users to discuss events and post their status, but opens up opportunities for new forms of spam and cybercrime. Twitter has become a target for spammers to disseminate their target messages. Spammers post tweets containing typical words of a trending topic and URLs, usually obfuscated by URL shorteners that lead users to completely unrelated websites.

Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords, personal details and banking details often for malicious reasons under the disguise of a trustworthy entity in an electronic community. Traditionally, phishing emails contain links that take advantage of a user’s trust. Phishing sites are usually almost identical imitations of genuine ones, taking advantage of average users to obtain private information, generally financial details. With an increased popularity in social media networks, links to phishing sites are commonly found on these platforms. These links are often masked in shortened URLs to hide true URLs. This category of spam can jeopardise and de-value real time search services unless an efficient and accurate automated mechanism to detect phishers is found.

Although the research community and industry have been developing techniques to identify phishing attacks through other media [6, 810] such as email and instant messaging, there is very little research that provides a deeper understanding of phishing in online social media. Moreover these phishers are sophisticated and adaptable to game the system with fast evolving content and network patterns. Phishers continually change their phishing behaviour patterns to avoid being detected. It is challenging for existing anti-phishing systems to quickly respond to newly emerging patterns for effective phishing detection. Moreover relying on the network connection information means that the phishers would have to have been active over a period to build up the connections. A good anti-phishing detection algorithm should be able to detect phishing as efficiently and early as possible.

We proposed a two phase approach called Phishing Detector for Twitter (PDT), that combines a density based clustering algorithm, DBSCAN, with DerTIA algorithm used in detecting attacks on recommender systems. We adapted both these approaches to detect phishing on Twitter. The main contributions of this paper are outlined as follows: (1) Introduced a phishing detection algorithm, PDT, an unsupervised learning approach, which does not rely on social influence (social network connections); (2) Described a systematic feature selection and analysis process.

The remainder of this paper is structured as follows. Section 2 discusses various methods researchers have designed to confront this problem of detecting phishing or spam in Twitter and other media. Then in Sect. 3, we review the outline of our technique. Section 4 details the data collection methodology of our research. Section 5 discusses selected features, performs analysis on features from collected samples. Section 3 details our unsupervised technique to detect phishing in Twitter. In Sect. 7, we review experimental setups and results on several data sets. Lastly, Sect. 8 concludes the paper.

2 Related Work

Phishing has been found in various traditional web applications such as emails, websites, blogs, forums and social media networks. Numerous preventative methods have been developed to fight against phishing.

List-Based Techniques. The list-based anti-phishing mechanism is a technique commonly used at low cost. Its strength comes from speed and simplicity. Classifying requires a simple lookup on the maintained database. Blacklists [14] are built into modern web browsers. A major drawback of the list-based mechanisms is that the accuracy is highly dependent on the completeness of the list. It takes time and effort to maintain the lists. Google uses automated proprietary algorithms to maintain a list of fraud websites whereas PhishTank [1] relies on contributions from online communities.

Machine Learning Based Techniques. With increases in computing power, phishing detections involving machine learning has emerged [2]. These approaches utilise one or more characteristics found on a site and build rules to detect phishing. Pre-labelled samples have a pivotal role in buiding a classifier. Garera et al. [9] proposed a technique that uses structure of URL in conjunction with logistic regression classification and Google PageRank in order to determine if a URL is legitimate or phishing.

The following two techniques employ a visual cue in detection. GoldPhish by Dunlop et al. [7] implements an unusual classification approach that takes a screenshot of a target website and comparies it with a genuine one to find any discrete differences. In addition to the visual comparison, the classifier considers the extracted text from optical character recognition on the screenshot in the judgement. Zhang et al. [18] handled visual content differently. The system first takes a screenshot of the page in question and generates a unique signature from the captured image then the image is labelled by Visual Similarity Assessment (Earth Mover’s Distance). At the same time, the system extracts textual information from processed content. The textual features are then classified using Naïve Bayes’ Rule and combined with labelled image features. The classifications are evaluated by a statistical model to determine the final label. Cantina+ [16] is another feature based approach that detects phishing. It makes use of features found in DOM, search engines and third party services with machine learning techniques. The accompanying two novel filters are used to help reduce incorrectly labelled data and accomplish runtime speedup.

Phishing Detection Techniques for Twitter. There is no denying that social media networks have become the main target for spammers due to their increase in popularity in today’s world. Yardi et al. [17] studied spam in Twitter using network and temporal properties. Machine learning algorithms are incorporated in these techniques in order to uncover patterns exploited by spammers. In [15], the authors proposed a method that utilises graph-based and content-based features for Naïve Bayesian classifiers to distinguish the suspicious behaviors from normal Tweets. CAT [4] and the proposed method in [5] also use classification techniques to detect spammers. While the previously mentioned studies see spam detection as a classification problem, Miller et al. [13] viewed it as an anomaly detection problem. Two data stream algorithms, DenStream and StreamKM++, were used to facilitate spam identification. As opposed to spam detection on Twitter, there are a few studies carried out on phishing detection on Twitter. PhishAri [3] is a system that detects malicious URLs in tweets using URL-based, WHOIS-based, user-based and network-based features using a random forest classification algorithm. Warningbird [11] is another detection system by Lee et al. Unlike other conventional classifiers which are built on Twitter-based and URL-based features, Warningbird relies on the correlations of URL redirect chains that share the same redirection servers.

Current techniques suffer from similar limitations which include (1) difficulty in detecting phishers before they are sufficient inter-user relationships (or network information); (2) lack of robustness to changes within phishing trends; (3) timeliness and a significant effort to maintain the completeness in List-based techniques.

3 Overview: Phishing Detector for Twitter (PDT)

Given the ever-changing nature of the Twitter stream and behaviour of phishers, we postulate that unsupervised learning techniques could be a better way to detect phishing. This section details unsupervised learning algorithms and the cluster classifier that are used to identify phishing tweets. We propose a new technique called Phishing Detector for Twitter (PDT). Our approach has two phases. In Phase 1 we used DBSCAN to determine legitimate and phishing tweets. However as DBSCAN is a strict approach there are clusters with data points that are difficult to classify as legitimate or phishing. These indeterminate tweets (data points) are then passed into Phase 2. By using this two phase approach we are able to increase the accuracy of phishing detection significantly.

In the following sections we will discuss the data collection methodology, along side feature selection and the analysis we carried out. We then discuss our PDT approach in detail.

4 Data Collection Methodology

In this section, we describe the method of data collection and how each sample was labelled for this research.

Crawling Tweets. For this study, tweets were collected using the Twitter Public Streaming API. The API offers samples of public data flowing through Twitter around the world in real time. After a sample dataset was completed, tweets that did not contain URLs were removed from the dataset since they are irrelevant to the study. On October 4th, 2014, we collected 25, 350 tweets and of those, 5, 151 with URLs were used in this research as the initial experimental set. We later crawled additional tweets as validation sets.

Labeling Samples as Phishing or Legitimate. Despite having an unsupervised learning approach it was necessary to have suitable ground truth datasets to validate our results. Thus we annotated the collected tweets as phishing or legitimate by utilising two phishing blacklists from Google Safe Browsing API and Twitter. If a user is suspended, we have labelled this user as a spammer, therefore all of the URLs in his tweets were marked as phishing sites. Due to delays in Twitters phishing detection algorithm, we have waited for seven days and checked URLs against the databases to label the collected messages.

5 Feature Selection and Analysis

Previous studies on email phishing show that some features of URLs and email can be used to determine if a URL is malicious. In terms of Twitter, it lacks some of the features that emails hold, however, they may be substituted by other metadata that only a tweet embeds.

User Features. Each Twitter user has extra information that can be utilised to identify phishing tweets along with other features. We identified seven user features and they are shown in Table 1.

Table 1. Identified User Features

Tweet Features. Malicious tweets change their attributes over time to get higher visibility in the global ecosystem. Such tweets try to gain attraction by including keywords of trending topics (often by using hashtags), mentioning popular users (denoted by @) and following other active users. They include such tokens to increase their visibility to users who use Twitter’s search facility or external search engines. By mentioning genuine users randomly, phishing tweets can make themselves visible to those users. Additional metadata of a tweet can be retrieved via a Twitter API. This includes, but is not limited to, retweet counts, favourite counts and a Boolean flag a user may set to indicate presence of possible sensitive information. We identified eight tweet features shown in Table 2.

Table 2. Identified Tweet Features

URL Features. A number of case studies on detecting phishing emails have already revealed that URL features contribute to the identification of phishing sites. Many phishing sites abuse browser redirection to bypass blacklists therefore the number of redirections between the initial URL and the final URL is another feature we collected. We have also identified extra features from WHOIS for our collected sample domains. We identified five URL features shown in Table 3.

Table 3. Identified URL Features

Feature Analysis. Prior to designing our phishing detection technique, two analyses were executed to gain better insight into the feature set and eliminate any visible noise features in the early phase. In addition to the datasets we collected, we have also obtained datasets from a group of authors who have completed similar work in Miller et al. [13].

Table 4 presents the ranking of features based on \(\overline{\chi ^2}\) values from our sample set. It can be noted that the most important features are URL count, age of account and dot count. Research on Twitter spammers presented by Benevenuto et al. displays a similar behaviour [5].

Table 4. Results of \(\chi ^2\) computation on the sample set

6 PDT Approach

In this section we describe our PDT approach in detail. In Sect. 6.1 we discuss the DBSCAN technique which was used in Phase 1, and in Sect. 6.2 we discuss the Der-TIA approach adapted for our technique.

6.1 Phase 1: DBSCAN

Density Based Spatial Clustering of Applications with Noise (DBSCAN) is a density based data clustering algorithm. The principal idea of this algorithm is that a set of points closely packed together forms a cluster and any data points in the low density region are labelled as outliers.

Fig. 1.
figure 1

Observed OLS patterns for pairwise features in clusters plotted on log-log charts. Red markers represent phishing samples and blue markers represent legitimate samples. The first two clusters are labelled as legitimate and the remaining clusters are labelled as phishing. (Color figure online)

Cluster Classifier. DBSCAN can cluster data only and is unable to assign them into phishing or non-phishing categories. In order to carry out the assignment, Ordinary Least Squares (OLS) method was used, which computes the linear approximation for each pairwise feature. Distinctive linear regressions for the feature pairs were identified for all clusters. They were further analysed and patterns were observed as shown in Fig. 1 to draw four rules as described in Table 5. This was carried out on the initial set.

Table 5. Cluster classification rules (\(\beta \) is the regression coefficient in OLS method)

The minimum requirement size of a cluster is 30 data points. If the size is less than 30, the cluster is considered as insufficient to comprise any one of the patterns we identified earlier. Such clusters are then indeterminate. The second phase is necessary to filter the phishing data point from the clusters that were indeterminate. Additionally, a cluster is determined by this procedure when at least 3 out of 4 labels are the same.

If the numbers of phishing and non-phishing labels are the same and the majority cannot be found from the results, then the cluster is also labelled as indeterministic as shown in Table 6. Data within clusters that were indeterminate were then pass through Phase 2.

Table 6. Cluster classification results

6.2 Phase 2: DeR-TIA

In order to classify indeterminate data points from the first phase, a technique called DeR-TIA by [19] is used. DeR-TIA combines the results of two different measures techniques to detect abnormalities in recommender systems.

DegSim. The first part of the algorithm finds the similarities between data sets using the Pearson correlation algorithm in conjunction with the k-Nearest Neighbour algorithm. The Pearson correlation algorithm yields a linear correlation (dependencies) between two data points.

$$\begin{aligned} W_{pq} = \frac{\sum _{f \in F} (V_{pf} - \overline{V_p}) (V_{qf} - \overline{V_q})}{\sqrt{\sum _{f \in F} (V_{pf} - \overline{V_p})^2 (V_{qf} - \overline{V_q})^2}} \end{aligned}$$
(1)

where F is the set of all features, \(V_{pf}\) is the value of feature f in the data point p and \(\overline{V_p}\) is the mean value of data point p. The outcomes from the above formula range between \(-1\) and 1. A positive value indicates that the inputs have a positive correlation and vice versa. If there is no correlation found, the outcome will be 0 and the boundary values, \(-1\) and 1, illustrate that there are very strong correlations. The closer the outcome value is to the boundary limits, the closer the data points to the line of best fit.

The DegSim attribute is defined as the average Pearson correlation value of the k-Nearest Neighbours (k-NN) over the k.

$$\begin{aligned} DegSim = {{\sum _{p=1}^k W_{pq}} \over {k}} \end{aligned}$$
(2)

where \(W_{pq}\) is the Pearson correlation between data point p and q where k is the number of neighbours.

RDMA. The second part of the algorithm is Rating Deviation from Mean Agreement (RDMA) technique. This technique measures the deviation of agreement from other data points on the entire dataset. The RDMA measure is as follows:

$$\begin{aligned} RDMA_p = {{\sum _{f \in F} {{|V_{pf} - \overline{V_f}|} \over {NV_f}}} \over {N_p}} \end{aligned}$$
(3)

where \(N_p\) is the number of features data set p has, F is the set of all features, \(V_{pf}\) is the value of feature f in the data point p, \(\overline{V_f}\) is the mean value of feature f in the entire sample set and \({NV_f}\) is the overall number of feature f in the sample set.

Once values for RDMA and DegSim are computed, the dataset of each property is split into two clusters to find the legitimate and phishing groups using the k-means algorithm where the value of k fixed at 2. The labels of clusters are determined by the cluster size. As discussed earlier, the majority of the URLs in tweets are non-phishing, therefore the larger cluster is classified as legitimate. Then the phishing clusters from RDMA and DegSim are intersected to get the final set of phishing data points.

7 Experiments

To compare the performance of our technique, we anaylsed our technique against DBSCAN. The two major reasons we chose DBSCAN as our baseline algorithm were (1) DBSCAN is an unsupervised learning technique that does not require network data features [12], and (2) DBSCAN was used in Phase 1 of our technique, thus we are able to measure improvement rate of our second phase compared to using only DBSCAN.

7.1 Experimental Setup

The experimental environment is set up with a t2.micro compute instance running on Amazon Elastic Compute Cloud platform. t2.micro instances are configured with an Intel Xeon 2.5 GHz CPU and 1 G RAM. Public tweet stream crawler and analysis tools for this research are written in Python with scientific libraries such as numpy, scipy, mathplotlib and StatsModels. 15 distinct tweet sets (VS) were sampled. Table 7 summaries collected sample sets for this research.

Table 7. Samples information

In order to evaluate the correctness of our experiment method based on the features set collected, the commonly accepted information retrieval metrics are used. Accuracy, precision, recall and f-measure are measures used to quantify the effectiveness of PDT.

7.2 Experimental Results

By using PDT we notice an average improvement of 63 % in accuracy and an average improvement of 47 % in f-measure compared against DBSCAN. We also note that although there were no increase in precision there was a 62 % increase in recall value when we compared DBSCAN to PDT. We do note that the original precision in DBSCAN was relatively high with an average pf 99.7 %. Any further improvements would have been marginal.

Table 8. Obtained metrics from different samples

We discovered that the accuracy of the results from validation samples are noticeably higher than the initial experiment results. It can be speculated that with the underlying dynamic behaviour of social media, that either Twitter user’s usages or phishing trend or both have been changed over the 208 day period since the time of the initial sample collection. We postulate that this is due to the adaptive nature of our unsupervised learning based technique, which can adapt to the continuously-changing Twitter ecosystem. The samples we collected over the three days show consistency in all metrics as supported by low standard deviation values. The Mann-Whitney test on the f-measures supports that PDT is significantly more accurate than DBSCAN with input parameters as \(U = 16\), Z-ratio \(= -4.2038\), \(P \le 0.05\) two-tailed (Table 8).

8 Conclusions

In this research we proposed a technique, called PDT, which is a two phase approach adapting both DBSCAN and DeR-TIA to improve the coverage. Our technique PDT shows promising outcomes. We showed that our technique produced high accuracy, precision, recall, and f-measure values compared to previous technique namely DBSCAN. Moreover PDT can adapt over time for phishing detection with behavioral changes in Twitter. We have shown that our technique worked over data collected from different periods of time. Using other traditional supervised methods we would not have been able to guarantee that it could adapt to the changing nature of the Twitter data.

For further development, phishing detection through a sliding window over the stream of new tweets would be useful. A fixed size window accepts new data from Twitter streams and eliminates obsolete data from the window pool. Thus ensuring recency of information evaluated and may increase the accuracy of the results produced.