Abstract
With the prevalence of cutting-edge technology, the social media network is gaining popularity and is becoming a worldwide phenomenon. Twitter is one of the most widely used social media sites, with over 500 million users all around the world. Along with its rapidly growing number of users, it has also attracted unwanted users such as scammers, spammers and phishers. Research has already been conducted to prevent such issues using network or contextual features with supervised learning. However, these methods are not robust to changes, such as temporal changes or changes in phishing trends. Current techniques also use additional network information. However, these techniques cannot be used before spammers form a particular number of user relationships. We propose an unsupervised technique that detects phishing in Twitter using a 2-phase unsupervised learning algorithm called PDT (Phishing Detector for Twitter). From the experiments we show that our technique has high accuracy ranging between 0.88 and 0.99.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
As the user base of Twitter steadily grows into the millions, real time search systems and different types of mining tools are emerging to enable people to track events and news on Twitter. These services are appealing mechanisms to ease the spread of news and allow users to discuss events and post their status, but opens up opportunities for new forms of spam and cybercrime. Twitter has become a target for spammers to disseminate their target messages. Spammers post tweets containing typical words of a trending topic and URLs, usually obfuscated by URL shorteners that lead users to completely unrelated websites.
Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords, personal details and banking details often for malicious reasons under the disguise of a trustworthy entity in an electronic community. Traditionally, phishing emails contain links that take advantage of a user’s trust. Phishing sites are usually almost identical imitations of genuine ones, taking advantage of average users to obtain private information, generally financial details. With an increased popularity in social media networks, links to phishing sites are commonly found on these platforms. These links are often masked in shortened URLs to hide true URLs. This category of spam can jeopardise and de-value real time search services unless an efficient and accurate automated mechanism to detect phishers is found.
Although the research community and industry have been developing techniques to identify phishing attacks through other media [6, 8–10] such as email and instant messaging, there is very little research that provides a deeper understanding of phishing in online social media. Moreover these phishers are sophisticated and adaptable to game the system with fast evolving content and network patterns. Phishers continually change their phishing behaviour patterns to avoid being detected. It is challenging for existing anti-phishing systems to quickly respond to newly emerging patterns for effective phishing detection. Moreover relying on the network connection information means that the phishers would have to have been active over a period to build up the connections. A good anti-phishing detection algorithm should be able to detect phishing as efficiently and early as possible.
We proposed a two phase approach called Phishing Detector for Twitter (PDT), that combines a density based clustering algorithm, DBSCAN, with DerTIA algorithm used in detecting attacks on recommender systems. We adapted both these approaches to detect phishing on Twitter. The main contributions of this paper are outlined as follows: (1) Introduced a phishing detection algorithm, PDT, an unsupervised learning approach, which does not rely on social influence (social network connections); (2) Described a systematic feature selection and analysis process.
The remainder of this paper is structured as follows. Section 2 discusses various methods researchers have designed to confront this problem of detecting phishing or spam in Twitter and other media. Then in Sect. 3, we review the outline of our technique. Section 4 details the data collection methodology of our research. Section 5 discusses selected features, performs analysis on features from collected samples. Section 3 details our unsupervised technique to detect phishing in Twitter. In Sect. 7, we review experimental setups and results on several data sets. Lastly, Sect. 8 concludes the paper.
2 Related Work
Phishing has been found in various traditional web applications such as emails, websites, blogs, forums and social media networks. Numerous preventative methods have been developed to fight against phishing.
List-Based Techniques. The list-based anti-phishing mechanism is a technique commonly used at low cost. Its strength comes from speed and simplicity. Classifying requires a simple lookup on the maintained database. Blacklists [14] are built into modern web browsers. A major drawback of the list-based mechanisms is that the accuracy is highly dependent on the completeness of the list. It takes time and effort to maintain the lists. Google uses automated proprietary algorithms to maintain a list of fraud websites whereas PhishTank [1] relies on contributions from online communities.
Machine Learning Based Techniques. With increases in computing power, phishing detections involving machine learning has emerged [2]. These approaches utilise one or more characteristics found on a site and build rules to detect phishing. Pre-labelled samples have a pivotal role in buiding a classifier. Garera et al. [9] proposed a technique that uses structure of URL in conjunction with logistic regression classification and Google PageRank in order to determine if a URL is legitimate or phishing.
The following two techniques employ a visual cue in detection. GoldPhish by Dunlop et al. [7] implements an unusual classification approach that takes a screenshot of a target website and comparies it with a genuine one to find any discrete differences. In addition to the visual comparison, the classifier considers the extracted text from optical character recognition on the screenshot in the judgement. Zhang et al. [18] handled visual content differently. The system first takes a screenshot of the page in question and generates a unique signature from the captured image then the image is labelled by Visual Similarity Assessment (Earth Mover’s Distance). At the same time, the system extracts textual information from processed content. The textual features are then classified using Naïve Bayes’ Rule and combined with labelled image features. The classifications are evaluated by a statistical model to determine the final label. Cantina+ [16] is another feature based approach that detects phishing. It makes use of features found in DOM, search engines and third party services with machine learning techniques. The accompanying two novel filters are used to help reduce incorrectly labelled data and accomplish runtime speedup.
Phishing Detection Techniques for Twitter. There is no denying that social media networks have become the main target for spammers due to their increase in popularity in today’s world. Yardi et al. [17] studied spam in Twitter using network and temporal properties. Machine learning algorithms are incorporated in these techniques in order to uncover patterns exploited by spammers. In [15], the authors proposed a method that utilises graph-based and content-based features for Naïve Bayesian classifiers to distinguish the suspicious behaviors from normal Tweets. CAT [4] and the proposed method in [5] also use classification techniques to detect spammers. While the previously mentioned studies see spam detection as a classification problem, Miller et al. [13] viewed it as an anomaly detection problem. Two data stream algorithms, DenStream and StreamKM++, were used to facilitate spam identification. As opposed to spam detection on Twitter, there are a few studies carried out on phishing detection on Twitter. PhishAri [3] is a system that detects malicious URLs in tweets using URL-based, WHOIS-based, user-based and network-based features using a random forest classification algorithm. Warningbird [11] is another detection system by Lee et al. Unlike other conventional classifiers which are built on Twitter-based and URL-based features, Warningbird relies on the correlations of URL redirect chains that share the same redirection servers.
Current techniques suffer from similar limitations which include (1) difficulty in detecting phishers before they are sufficient inter-user relationships (or network information); (2) lack of robustness to changes within phishing trends; (3) timeliness and a significant effort to maintain the completeness in List-based techniques.
3 Overview: Phishing Detector for Twitter (PDT)
Given the ever-changing nature of the Twitter stream and behaviour of phishers, we postulate that unsupervised learning techniques could be a better way to detect phishing. This section details unsupervised learning algorithms and the cluster classifier that are used to identify phishing tweets. We propose a new technique called Phishing Detector for Twitter (PDT). Our approach has two phases. In Phase 1 we used DBSCAN to determine legitimate and phishing tweets. However as DBSCAN is a strict approach there are clusters with data points that are difficult to classify as legitimate or phishing. These indeterminate tweets (data points) are then passed into Phase 2. By using this two phase approach we are able to increase the accuracy of phishing detection significantly.
In the following sections we will discuss the data collection methodology, along side feature selection and the analysis we carried out. We then discuss our PDT approach in detail.
4 Data Collection Methodology
In this section, we describe the method of data collection and how each sample was labelled for this research.
Crawling Tweets. For this study, tweets were collected using the Twitter Public Streaming API. The API offers samples of public data flowing through Twitter around the world in real time. After a sample dataset was completed, tweets that did not contain URLs were removed from the dataset since they are irrelevant to the study. On October 4th, 2014, we collected 25, 350 tweets and of those, 5, 151 with URLs were used in this research as the initial experimental set. We later crawled additional tweets as validation sets.
Labeling Samples as Phishing or Legitimate. Despite having an unsupervised learning approach it was necessary to have suitable ground truth datasets to validate our results. Thus we annotated the collected tweets as phishing or legitimate by utilising two phishing blacklists from Google Safe Browsing API and Twitter. If a user is suspended, we have labelled this user as a spammer, therefore all of the URLs in his tweets were marked as phishing sites. Due to delays in Twitters phishing detection algorithm, we have waited for seven days and checked URLs against the databases to label the collected messages.
5 Feature Selection and Analysis
Previous studies on email phishing show that some features of URLs and email can be used to determine if a URL is malicious. In terms of Twitter, it lacks some of the features that emails hold, however, they may be substituted by other metadata that only a tweet embeds.
User Features. Each Twitter user has extra information that can be utilised to identify phishing tweets along with other features. We identified seven user features and they are shown in Table 1.
Tweet Features. Malicious tweets change their attributes over time to get higher visibility in the global ecosystem. Such tweets try to gain attraction by including keywords of trending topics (often by using hashtags), mentioning popular users (denoted by @) and following other active users. They include such tokens to increase their visibility to users who use Twitter’s search facility or external search engines. By mentioning genuine users randomly, phishing tweets can make themselves visible to those users. Additional metadata of a tweet can be retrieved via a Twitter API. This includes, but is not limited to, retweet counts, favourite counts and a Boolean flag a user may set to indicate presence of possible sensitive information. We identified eight tweet features shown in Table 2.
URL Features. A number of case studies on detecting phishing emails have already revealed that URL features contribute to the identification of phishing sites. Many phishing sites abuse browser redirection to bypass blacklists therefore the number of redirections between the initial URL and the final URL is another feature we collected. We have also identified extra features from WHOIS for our collected sample domains. We identified five URL features shown in Table 3.
Feature Analysis. Prior to designing our phishing detection technique, two analyses were executed to gain better insight into the feature set and eliminate any visible noise features in the early phase. In addition to the datasets we collected, we have also obtained datasets from a group of authors who have completed similar work in Miller et al. [13].
Table 4 presents the ranking of features based on \(\overline{\chi ^2}\) values from our sample set. It can be noted that the most important features are URL count, age of account and dot count. Research on Twitter spammers presented by Benevenuto et al. displays a similar behaviour [5].
6 PDT Approach
In this section we describe our PDT approach in detail. In Sect. 6.1 we discuss the DBSCAN technique which was used in Phase 1, and in Sect. 6.2 we discuss the Der-TIA approach adapted for our technique.
6.1 Phase 1: DBSCAN
Density Based Spatial Clustering of Applications with Noise (DBSCAN) is a density based data clustering algorithm. The principal idea of this algorithm is that a set of points closely packed together forms a cluster and any data points in the low density region are labelled as outliers.
Cluster Classifier. DBSCAN can cluster data only and is unable to assign them into phishing or non-phishing categories. In order to carry out the assignment, Ordinary Least Squares (OLS) method was used, which computes the linear approximation for each pairwise feature. Distinctive linear regressions for the feature pairs were identified for all clusters. They were further analysed and patterns were observed as shown in Fig. 1 to draw four rules as described in Table 5. This was carried out on the initial set.
The minimum requirement size of a cluster is 30 data points. If the size is less than 30, the cluster is considered as insufficient to comprise any one of the patterns we identified earlier. Such clusters are then indeterminate. The second phase is necessary to filter the phishing data point from the clusters that were indeterminate. Additionally, a cluster is determined by this procedure when at least 3 out of 4 labels are the same.
If the numbers of phishing and non-phishing labels are the same and the majority cannot be found from the results, then the cluster is also labelled as indeterministic as shown in Table 6. Data within clusters that were indeterminate were then pass through Phase 2.
6.2 Phase 2: DeR-TIA
In order to classify indeterminate data points from the first phase, a technique called DeR-TIA by [19] is used. DeR-TIA combines the results of two different measures techniques to detect abnormalities in recommender systems.
DegSim. The first part of the algorithm finds the similarities between data sets using the Pearson correlation algorithm in conjunction with the k-Nearest Neighbour algorithm. The Pearson correlation algorithm yields a linear correlation (dependencies) between two data points.
where F is the set of all features, \(V_{pf}\) is the value of feature f in the data point p and \(\overline{V_p}\) is the mean value of data point p. The outcomes from the above formula range between \(-1\) and 1. A positive value indicates that the inputs have a positive correlation and vice versa. If there is no correlation found, the outcome will be 0 and the boundary values, \(-1\) and 1, illustrate that there are very strong correlations. The closer the outcome value is to the boundary limits, the closer the data points to the line of best fit.
The DegSim attribute is defined as the average Pearson correlation value of the k-Nearest Neighbours (k-NN) over the k.
where \(W_{pq}\) is the Pearson correlation between data point p and q where k is the number of neighbours.
RDMA. The second part of the algorithm is Rating Deviation from Mean Agreement (RDMA) technique. This technique measures the deviation of agreement from other data points on the entire dataset. The RDMA measure is as follows:
where \(N_p\) is the number of features data set p has, F is the set of all features, \(V_{pf}\) is the value of feature f in the data point p, \(\overline{V_f}\) is the mean value of feature f in the entire sample set and \({NV_f}\) is the overall number of feature f in the sample set.
Once values for RDMA and DegSim are computed, the dataset of each property is split into two clusters to find the legitimate and phishing groups using the k-means algorithm where the value of k fixed at 2. The labels of clusters are determined by the cluster size. As discussed earlier, the majority of the URLs in tweets are non-phishing, therefore the larger cluster is classified as legitimate. Then the phishing clusters from RDMA and DegSim are intersected to get the final set of phishing data points.
7 Experiments
To compare the performance of our technique, we anaylsed our technique against DBSCAN. The two major reasons we chose DBSCAN as our baseline algorithm were (1) DBSCAN is an unsupervised learning technique that does not require network data features [12], and (2) DBSCAN was used in Phase 1 of our technique, thus we are able to measure improvement rate of our second phase compared to using only DBSCAN.
7.1 Experimental Setup
The experimental environment is set up with a t2.micro compute instance running on Amazon Elastic Compute Cloud platform. t2.micro instances are configured with an Intel Xeon 2.5 GHz CPU and 1 G RAM. Public tweet stream crawler and analysis tools for this research are written in Python with scientific libraries such as numpy, scipy, mathplotlib and StatsModels. 15 distinct tweet sets (VS) were sampled. Table 7 summaries collected sample sets for this research.
In order to evaluate the correctness of our experiment method based on the features set collected, the commonly accepted information retrieval metrics are used. Accuracy, precision, recall and f-measure are measures used to quantify the effectiveness of PDT.
7.2 Experimental Results
By using PDT we notice an average improvement of 63 % in accuracy and an average improvement of 47 % in f-measure compared against DBSCAN. We also note that although there were no increase in precision there was a 62 % increase in recall value when we compared DBSCAN to PDT. We do note that the original precision in DBSCAN was relatively high with an average pf 99.7 %. Any further improvements would have been marginal.
We discovered that the accuracy of the results from validation samples are noticeably higher than the initial experiment results. It can be speculated that with the underlying dynamic behaviour of social media, that either Twitter user’s usages or phishing trend or both have been changed over the 208 day period since the time of the initial sample collection. We postulate that this is due to the adaptive nature of our unsupervised learning based technique, which can adapt to the continuously-changing Twitter ecosystem. The samples we collected over the three days show consistency in all metrics as supported by low standard deviation values. The Mann-Whitney test on the f-measures supports that PDT is significantly more accurate than DBSCAN with input parameters as \(U = 16\), Z-ratio \(= -4.2038\), \(P \le 0.05\) two-tailed (Table 8).
8 Conclusions
In this research we proposed a technique, called PDT, which is a two phase approach adapting both DBSCAN and DeR-TIA to improve the coverage. Our technique PDT shows promising outcomes. We showed that our technique produced high accuracy, precision, recall, and f-measure values compared to previous technique namely DBSCAN. Moreover PDT can adapt over time for phishing detection with behavioral changes in Twitter. We have shown that our technique worked over data collected from different periods of time. Using other traditional supervised methods we would not have been able to guarantee that it could adapt to the changing nature of the Twitter data.
For further development, phishing detection through a sliding window over the stream of new tweets would be useful. A fixed size window accepts new data from Twitter streams and eliminates obsolete data from the window pool. Thus ensuring recency of information evaluated and may increase the accuracy of the results produced.
References
Phishtank — join the fight against phishing, June 2015. https://phishtank.com
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 60–69. ACM (2007)
Aggarwal, A., Rajadesingan, A., Kumaraguru, P.: Phishari: automatic realtime phishing detection on twitter. In: eCrime Researchers Summit (eCrime), 2012, pp. 1–12. IEEE (2012)
Amleshwaram, A.A., Reddy, N., Yadav, S., Gu, G., Yang, C.: Cats: characterizing automation of Twitter spammers. In: 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS), pp. 1–10. IEEE (2013)
Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on Twitter. In: Collaboration, Electronic Messaging, Anti-abuse and Spam Conference (CEAS), vol. 6, p. 12 (2010)
Chhabra, S., Aggarwal, A., Benevenuto, F., Kumaraguru, P.: Phi.sh/$ocial: the phishing landscape through short urls. In: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-abuse and Spam Conference, pp. 92–101. ACM (2011)
Dunlop, M., Groat, S., Shelly, D.: Goldphish: using images for content-based phishing analysis. In: 2010 Fifth International Conference on Internet Monitoring and Protection (ICIMP), pp. 123–128. IEEE (2010)
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM (2007)
Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2007)
Klien, F., Strohmaier, M.: Short links under attack: geographical analysis of spam in a url shortener network. In: Proceedings of the 23rd ACM Conference on Hypertext and Social Media, pp. 83–88. ACM (2012)
Lee, S., Kim, J.: Warningbird: a near real-time detection system for suspicious urls in Twitter stream. IEEE Trans. Dependable Secure Comput. 3, 183–195 (2013)
Liu, G., Qiu, B., Wenyin, L.: Automatic detection of phishing target from phishing webpage. In: 20th International Conference on Pattern Recognition (ICPR), pp. 4153–4156, August 2010
Miller, Z., Dickinson, B., Deitrick, W., Hu, W., Wang, A.H.: Twitter spammer detection using data stream clustering. Inf. Sci. 260, 64–73 (2014)
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Sixth Conference on Email and Anti-Spam (CEAS), California, USA (2009)
Wang, A.H.: Don’t follow me: spam detection in Twitter. In: Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1–10. IEEE (2010)
Xiang, G., Hong, J., Rose, C.P., Cranor, L.: Cantina+: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. (TISSEC) 14(2), 21 (2011)
Yardi, S., Romero, D., Schoenebeck, G.: Detecting spam in a Twitter network. First Mon. 15(1) (2010)
Zhang, H., Liu, G., Chow, T.W., Liu, W.: Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans. Neural Netw. 22(10), 1532–1546 (2011)
Zhou, W., Wen, J., Koh, Y.S., Alam, S., Dobbie, G.: Attack detection in recommender systems based on target item analysis. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 332–339. IEEE (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Jeong, S.Y., Koh, Y.S., Dobbie, G. (2016). Phishing Detection on Twitter Streams. In: Cao, H., Li, J., Wang, R. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9794. Springer, Cham. https://doi.org/10.1007/978-3-319-42996-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-42996-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42995-3
Online ISBN: 978-3-319-42996-0
eBook Packages: Computer ScienceComputer Science (R0)