Malicious URL detection via spherical classification

Astorino, A.; Chiarello, A.; Gaudioso, M.; Piccolo, A.

doi:10.1007/s00521-016-2374-9

Malicious URL detection via spherical classification

Original Article
Published: 07 June 2016

Volume 28, pages 699–705, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Malicious URL detection via spherical classification

Download PDF

A. Astorino¹,
A. Chiarello¹,
M. Gaudioso ORCID: orcid.org/0000-0002-3022-7041² &
…
A. Piccolo²

677 Accesses
15 Citations
Explore all metrics

Abstract

We introduce and test a binary classification method aimed at detecting malicious URL on the basis of some information on both the URL syntax and its domain properties. Our method belongs to the class of supervised machine learning models, where, in particular, classification is performed by using information coming from a set of URL’s (samples in machine learning parlance) whose class membership is known in advance. The main novelty of our approach is in the use of a spherical separation-based algorithm, instead of SVM-type methods, which are based on hyperplanes as separation surfaces in the sample space. In particular we adopt a simplified spherical separation model which runs in O(tlogt) time (t is the number of samples in the training set), and thus is suitable for large-scale applications. We test our approach using different sets of features and report the results in terms of training correctness according to the well-established tenfold cross-validation paradigm.

Characterization of Malicious URLs Using Machine Learning and Feature Engineering

Empirical Study on Malicious URL Detection Using Machine Learning

An intelligent identification and classification system for malicious uniform resource locators (URLs)

Article 20 April 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A useful resource to prevent risks in computer security is provided by the so-called black lists, which are databases containing a typically large number of IP addresses, domain names and related URL’s for suspicious sites in terms of generation of threats. A rich literature is available on the creation and usage of such lists (see, e.g., [17]). If a URL is comprised into a black list, it is convenient to deviate the network traffic from it and, in fact, many internet service providers (ISP) simply block all messages coming from it. The users who detect anomalies in messages or activities they consider suspicious often transfer the related information to appropriate Web sites devoted to risk analysis.

Other possible way to compile black lists is the use of certain spam trap addresses which are diffused in the aim of being contacted by crawler spiders, typically used by phishers. As soon as one of such site address is contacted, the calling site is included into the black list.

Although black lists are rather useful, we cannot expect that they are exhaustive of all possible threats, either because the number of potentially dangerous site is extremely high or because the system is highly dynamic and it is almost impossible to keep any black list sufficiently updated.

Every time there exists any suspect about the malicious nature of a site the Whois service is able to provide some useful information in terms of IP, domain name and other characteristics related to it. Whois registers are publicly available, and there exist online services providing upon request such information, in an appropriate form.

The very basic idea of [10] is to use the information available about a given set of URL’s, in connection to the related Whois, to design a classifier based on some machine learning technique. In particular one of the tools adopted is the well-known SVM paradigm which, being suitable for supervised classification, requires that a sufficiently rich training set is available in advance. Such training set is constituted by a list of URL’s labeled in the form malicious–nonmalicious. A set of both qualitative and quantitative features is defined, and each URL is associated with a string of possible values of the features.

Of course different sets of features can be adopted for classification purposes, and in next section, we will describe in details the ones we have considered.

The differences of our approach w.r.t. the one previously cited [10] are twofold. As we are aimed at providing a methodology suitable for very large datasets too, we have confined ourselves to a limited number of features (e.g., we have used no “bag of words,” which in general requires an explosion in the size of the sample space) and, on the other hand, we have adopted a low complexity algorithm, accepting in advance the possibility of obtaining less accurate classification performance w.r.t. the SVM approach which requires solution of a structured quadratic programming problem [12].

Following [10] we take into account in our classification model both lexical features of the URL and host information, as those provided by the Whois. As a classification tool, we adopt the spherical separation paradigm [3, 4], which differs from SVM basically because separation in the feature space is not pursued by means of a hyperplane, instead by a spherical surface (application of ellipsoidal surfaces has been introduced too in [1]). Of course goodness of the classification tool depends on the geometry of the sets to be separated and cannot be easily predicted. Our choice for spherical separation has been dictated by the availability of a simplified approach to it [2], where the center of the separating sphere is calculated in advance and only the radius is optimized. Such approach allows us to calculate the classifier in $O(t \log t)$, where t is the size of the training set and, consequently, appears suitable for dealing with large-scale applications.

The paper is organized as follows. In Sect. 2 we introduce our classification model, describing in details the different set of features we have adopted. In Sect. 3 we summarize the spherical separation algorithm we have implemented. In Sect. 4 we report the results of our computational experience on some benchmark training sets drawn from the literature, together with some conclusions.

2 The classification model

We present now the list of seven features that we used in our work to detect malicious URLs. For our purposes, we decided to not analyze the URL’s page structure or its content: all the features have been generated using information derived from the general URL syntax (every URL consists of the following three parts: ${\texttt{{<}protocol{>}://{<}hostname{>}/{<}path{>}}})$. Some of the features used are strictly related to the lexical properties of the URL (intended as a character string) and other to the properties of the URL’s hostname (available thanks to a Whois query).

The features taken into consideration are:

1.
Number of subdomains This is the count of the subdomains which are detectable in the text of the URL. We do not consider only the URL’s hostname but also the URL’s path which is often used to redirect users to other dangerous Web sites.
2.
URL age We consider the age in days of the URL’s hostname (how long its content has been on the web). This is one of the most important features in malicious URL detection problem because dangerous sites usually have a very short life and are recently registered compared to safe sites. The registration date of URL’s hostname is needed to calculate the value of this feature.
3.
URL expiration We count the number of days remaining before the expiration date of URL’s hostname. Also this feature is potentially useful since dangerous sites usually are registered for shorter time than safe ones.
4.
Hostname length This is the simply count of the textual characters forming the URL’s hostname.
5.
Path length This is the simply count of the textual characters forming the URL’s path. Excessive length is often correlated with suspicious re-addressing.
6.
IP address geographical location IP addresses related to dangerous sites are usually located in specific geographical areas so we use this feature to express the information about the country location of the IP address related to the URL’s hostname. The list of countries used to evaluate this parameter has been reduced to just ten nations: Canada, China, Finland, France, India, Italy, Spain, Turkey, UK and USA. All other countries have been bundled together into a unique term, for the IP address located in countries not included in our list. This is a categorical feature.
7.
Presence or not of the word “Login” Dangerous sites usually contain in the text of their web address specific terms in order to cheat the web users. Thus, we have included such binary feature to report possible presence of the word “Login” in the text of the URL.

According to their definition, all the features taken into consideration (but “IP address geographical location”) can assume integer values greater than or equal to zero. As the number of features we have adopted is small, we have performed several experiments using different sets of features. According to the number of features used, the URL is represented in a different vector format. Using different sets of features it is important to detect the relevant ones and to discard those unable to provide any significant contribution to the classification process.

The malicious URLs detection problem has been modeled as a binary classification problem, and the predefined classes of URLs have been two: the set of malicious URLs (including URLs dangerous for the web users) and the set of benign URLs (including URLs safe for the web users).

3 Spherical separation

In many supervised machine learning problems the objective is to assign elements to a finite set of classes or categories. For a given set of sample points coming from two classes, we want to construct a function for discriminating between the classes. The goal is to select a function that will efficiently and correctly classify future points. Classification techniques can be used for data mining or pattern recognition, where many applications require a categorization.

The classical binary classification problem is to discriminate between two finite sets of points in the n-dimensional space, by a separating surface. The problem consists in finding a separating surface minimizing an appropriate measure of the classification error.

Several mathematical programming-based approaches for binary classification have been historically proposed [5, 11, 15]. Among the more recent ones we recall the support vector machine (SVM) technique [6, 16], where a classifier is constructed by generating a hyperplane far away from the points of the two sets. By adopting kernel transformations within the SVM approach, we can obtain general nonlinear separation surfaces. In this case the basic idea is to map the data into a higher-dimensional space (the feature space) and to separate the two transformed sets by means of one hyperplane that corresponds to a nonlinear surface in the original input space.

Parallel to the development of SVM methods, the use of nonlinear separating surfaces in the dataset space, instead of hyperplanes, has received in recent years some attention. In particular, in our work, we have considered the spherical separation approach, characterized by the fact that the separation process takes place in the original input space and does not require mapping to higher dimension spaces.

More formally, let

$$\begin{aligned} {\mathcal{X}} = \{x_1,\ldots ,x_p\} \end{aligned}$$

be a set of samples (or points) $x_i \in {\mathbb{{IR}}}^n$. In the supervised learning, we assume that, in correspondence to any point $x_i$ of ${\mathcal{X}}$, a label $y_i$ is given. The case $y_i\in {\mathbb{{IR}}}$ is known as “regression,” while, when the label $y_i$ takes values in a discrete finite set, the task is a “classification” process. A particular case of the latter is the binary classification, where, for each i, the label $y_i$ can assume only two possible values. The objective of the supervised learning is to predict the label of any new sample only on the basis of the information of the labeled points (the training set).

In particular, in the binary classification problems, we consider the following partition of ${\mathcal{X}}$ into two nonempty sets:

$$\begin{aligned} {\mathcal{X}}_+:=\{(x_i,y_i) ~| ~x_i \in {\mathbb{{IR}}}^n, ~y_i=+1, \quad i=1,\ldots ,m\} \end{aligned}$$

and

$$\begin{aligned} {\mathcal{X}}_-:=\{(x_i,y_i) ~| ~x_i \in {\mathbb{{IR}}}^n, ~y_i=-1, \quad i=m+1,\ldots ,p\}. \end{aligned}$$

In the spherical separation approach we define the set ${\mathcal{X}}_+$ spherically separable from ${\mathcal{X}}_-$ if and only if there exists a sphere

$$\begin{aligned} S(x_0,R)\ :=\{x \in {\mathbb{R}}^n|(x-x_0)^T(x-x_0)\le R^2\} \end{aligned}$$

centered in $x_0\in {\mathbb{{IR}}}^n$ of radius $R\in {\mathbb{{IR}}}$, such that

$$\begin{aligned} y_i(\Vert x_i-x_0\Vert ^2-R^2)\le 0 \quad i=1,\ldots ,p. \end{aligned}$$

(3.1)

In addition, when inequalities (3.1) are strictly satisfied, then the sets ${\mathcal{X}}_+$ and ${\mathcal{X}}_-$ are strictly spherically separated, i.e.,

$$\begin{aligned} \Vert x_i-x_0\Vert ^2\le (R-M)^2 \quad i= & {} 1,\ldots ,m\nonumber \\ \Vert x_i-x_0\Vert ^2\ge (R+M)^2\quad i= & {} m+1,\ldots ,p, \end{aligned}$$

(3.2)

for some M, the margin, such that $0<M\le R$. Setting $q :=2RM$ and $r :=R^2+M^2$, inequalities (3.2) become:

$$\begin{aligned} q+y_i(\Vert x_i-x_0\Vert ^2-r)\le 0 \quad i =1,\ldots ,p. \end{aligned}$$

In general it is not easy to know in advance whether the two sets are strictly spherically separable; then, in [2–4, 9] a classification error function has been defined in order to find a minimal error separating sphere.

In particular, in [2, 3] a separating sphere has been obtained by minimizing the following objective function:

$$\begin{aligned} \displaystyle z_{S_1}(x_0, r)=r+C \displaystyle \sum _{i=1}^p\max \{0,y_i(\Vert x_i-x_0\Vert ^2-r)\}, \end{aligned}$$

with $M=0$ and $r\ge 0$. C is a positive parameter giving the trade-off between the minimization of the radius and the minimization of the classification error.

More precisely in [2] an ad hoc algorithm that finds the optimal solution in $O(t \log t)$, with $t=\max \{m, p-m\}$, has been presented for the case where the center $x_0$ is fixed. It basically consists of two phases: the sorting phase and the cutting phase. In the first one the sample points are sorted according to their distance from the center, while in the second one an optimal cut is found. The adopted simplification is rather drastic; nevertheless, a judicious choice of the center (e.g., the barycenter of the set ${\mathcal{X}}_+$) has allowed to obtain reasonably good separation results at a very low computational cost.

In [3] the spherical separation problem has been tackled without any constraint in the location of the center by means of DC algorithm (DCA) [8, 13], based on a difference of convex (DC) decomposition of the objective function. A similar DCA approach has been proposed in [9] for minimizing the following error function:

$$\begin{aligned} \displaystyle z_{S_2}(x_0, r)=r^2+C \displaystyle \sum _{i=1}^p\max \{0,y_i(\Vert x_i-x_0\Vert ^2-r)\}^2, \end{aligned}$$

with $M=0$. The choice of $z_{S_2}$ instead of $z_{S_1}$ is motivated by the fact that using $z_{S_2}$ allows a DC decomposition where all the computations in DCA are explicit.

Finally the DCA scheme has been used also in [4], for minimizing the following error function:

$$\begin{aligned} \displaystyle z_{S_3}(x_0, q, r)=C \displaystyle \sum _{i=1}^p\max \{0,y_i(\Vert x_i-x_0\Vert ^2-r)+q\}-q, \end{aligned}$$

with $0\le q \le r$ and $M\ge 0$, where the term $-q$ is aimed at maximizing the margin. Moreover, similarly to [2], also in [4] an ad hoc algorithm that finds the optimal solution in $O(t \log t)$ has been designed when, in function $z_{S_3}$, the center $x_0$ is fixed.

4 Computational experience

We present now the results of our numerical experiments in malicious URLs detection. We have adopted the spherical classification approach described in Sect. 3. In particular we have coded in MATLAB (on a Pentium V 2.60 GHz Notebook) the algorithm introduced in [2], which is able to find the optimal solution in case the center of the separating sphere is fixed in advance.

For evaluating the effectiveness of our approach we have used a dataset obtained as follows. The samples have been randomly selected using some sources available on the web; in particular, for malicious URLs we used the online black list called PhishTank [14], a free community site where anyone can submit, verify, track and share phishing data. As for benign URLs we have collected our samples by developing a fast web crawler (a program that visits web pages and collects information) capable of gathering unique URLs (not duplicated) specifying the minimum and maximum length of the URL. The crawling phase started from the DMOZ Open Directory Project [7], a human-edited directory of safe web pages. The size of our dataset is 245 malicious and 384 benign URLs (dataset 1).

We have applied the standard tenfold cross-validation protocol. The dataset has been divided into ten subsets of equal size, and each subset has been validated using the classifier obtained by training executed on the remaining nine subsets. The classification accuracy is defined as the average percentage of well-classified URLs (of both classes).

We have run the program for different values of the positive weighting parameter C. We have implemented two possible choices of the center $x_0$, by selecting, respectively, the barycenter of the set of malicious URLs ($x_0^{(1)}$) and the barycenter of the set of benign URLs ($x_0^{(2)}$). However, after our evaluation, we have noticed that setting the center $x_0$ equal to the barycenter of the set of malicious URLs was the most performing choice.

We have executed first an exhaustive feature selection process on five out of the seven features analyzed in Sect. 2: number of subdomains, URL age, URL expiration, hostname length and path length (Table 1). In Table 2 we report the most significant results obtained and the corresponding set of features. In Table 3 the true positive rate and the true negative rate for the best performing combination of features are reported, where the true positive rate is defined as the percentage of benign URLs correctly classified as benign URLs and the true negative rate as the percentage of malicious URLs correctly classified as malicious URLs.

Table 1 Features description

Full size table

Table 2 Average correctness on dataset 1

Full size table

Table 3 Average true positive and true negative rate on dataset 1

Full size table

The results of the feature selection process indicate that the best classification accuracy is obtained by considering a quite small subset of the available features, in particular the combination of the URL age and URL expiration features. The remaining ones are redundant and, sometimes, even misleading.

We have extended the numerical experiments applying our methodology to a bigger dataset constituted by 11,975 URLs, 5090 malicious and 6885 benign (dataset 2). It has been constructed by adopting the same methodology as for dataset 1. The results are shown in following Tables 4 and 5. Also in this case we report the subsets of features which have provided better results.

Table 4 Average correctness on dataset 2

Full size table

Table 5 Average true positive and true negative rate on dataset 2

Full size table

We have performed some additional experiments on both datasets by considering two more features: the IP address geographical location and the presence of the “suspicious” word Login in the text of the URL (Table 6). Since such information was not available for all samples, the size of the datasets has been reduced to 370 benign URLs and 210 malicious ones for dataset 1 and to 6432 benign URLs and 4520 malicious ones for dataset 2.

As for dataset 1, in Table 7 we report the most significant results obtained for this series of experiments and in Table 8 the true positive and true negative rate for the best performing combination of features.

Table 6 Features description

Full size table

Table 7 Additional experiments: average correctness on dataset 1

Full size table

Table 8 Additional experiments: average true positive and true negative rate on dataset 1

Full size table

The above results show that the two additional features did not provide any positive contribution in terms of classification accuracy. We remark the role played by the URL age in this setting too.

As far as dataset 2 is concerned, the results are reported in Tables 9 and 10. The introduction of the additional features has provided, in this case, a slight improvement.

We observe, finally, that our methodology has given a rather satisfactory quality in the classification process. It appears worth noting that, even for the larger dataset 2, the computation time has always been of the order of few seconds, which suggests possible utilization of our method at least as a preliminary classification tool in view of future very large-scale applications.

Table 9 Additional experiments: average correctness on dataset 2

Full size table

Table 10 Additional experiments: average true positive and true negative rate on dataset 2

Full size table

References

Astorino A, Gaudioso M (2005) Ellipsoidal separation for classification problems. Optim Methods Softw 20(2–3):261–270
MathSciNet MATH Google Scholar
Astorino A, Gaudioso M (2009) A fixed-center spherical separation algorithm with kernel transformations for classification problems. CMS 6(3):357–372
Article MathSciNet MATH Google Scholar
Astorino A, Fuduli A, Gaudioso M (2010) DC models for spherical separation. J Glob Optim 48(4):657–669
Article MathSciNet MATH Google Scholar
Astorino A, Fuduli A, Gaudioso M (2012) Margin maximization in spherical separation. Comput Optim Appl 53(2):301–322
Article MathSciNet MATH Google Scholar
Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1:23–34
Article Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book MATH Google Scholar
http://opendirectory.org/
Le Thi HA, Pham Dihn T (2005) The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann Oper Res 133:23–46
Article MathSciNet MATH Google Scholar
Le Thi HA, Le HM, Pham Dinh T, Van Huynh N (2013) Binary classification via spherical separator by DC programming and DCA. J Glob Optim 56:1393–1407
Article MathSciNet MATH Google Scholar
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. KDD’09, June 28–July 1, 2009. France, Paris, pp 1245–1253
Mangasarian OL (1965) Linear and nonlinear separation of patterns by linear programming. Oper Res 13:444–452
Article MathSciNet MATH Google Scholar
Palagi L, Sciandrone M (2005) On the convergence of a modified version of $SVM^{light}$ algorithm. Optim Methods Softw 20(2–3):317–334
Article MathSciNet MATH Google Scholar
Pham Dinh T, Le Thi HA (1998) A D.C. optimization algorithm for solving the trust-region subproblem. SIAM J Con Opt 8:476–505
Article MathSciNet MATH Google Scholar
PhishTank: http://www.phishtank.com/
Rosen JB (1965) Pattern separation by convex programming. J Math Anal Appl 10:123–134
Article MathSciNet MATH Google Scholar
Vapnik V (1995) The nature of the statistical learning theory. Springer, New York
Book MATH Google Scholar
Zhang J, Porras P, Ullrich J (2008) Highly predictive blacklisting. USENIX Security Symposium 2008—usenix.org

Download references

Acknowledgments

This work has been partially supported by Italian M.I.U.R. Programma Operativo Nazionale (PON) 2007–2013, Project “Protezione dei servizi digitali e di pagamento elettronico,” PON03PE_00032_2.

Author information

Authors and Affiliations

ICAR--CNR, c/o University of Calabria, 87036, Rende, CS, Italy
A. Astorino & A. Chiarello
DIMES, University of Calabria, 87036, Rende, CS, Italy
M. Gaudioso & A. Piccolo

Authors

A. Astorino
View author publications
You can also search for this author in PubMed Google Scholar
A. Chiarello
View author publications
You can also search for this author in PubMed Google Scholar
M. Gaudioso
View author publications
You can also search for this author in PubMed Google Scholar
A. Piccolo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Gaudioso.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Astorino, A., Chiarello, A., Gaudioso, M. et al. Malicious URL detection via spherical classification. Neural Comput & Applic 28 (Suppl 1), 699–705 (2017). https://doi.org/10.1007/s00521-016-2374-9

Download citation

Received: 01 April 2015
Accepted: 21 May 2016
Published: 07 June 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s00521-016-2374-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Malicious URL detection via spherical classification

Abstract

Similar content being viewed by others

Characterization of Malicious URLs Using Machine Learning and Feature Engineering

Empirical Study on Malicious URL Detection Using Machine Learning

An intelligent identification and classification system for malicious uniform resource locators (URLs)

1 Introduction

2 The classification model

3 Spherical separation

4 Computational experience

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Malicious URL detection via spherical classification

Abstract

Similar content being viewed by others

Characterization of Malicious URLs Using Machine Learning and Feature Engineering

Empirical Study on Malicious URL Detection Using Machine Learning

An intelligent identification and classification system for malicious uniform resource locators (URLs)

Explore related subjects

1 Introduction

2 The classification model

3 Spherical separation

4 Computational experience

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation