1 Introduction

Computer security is one of the most important issues in the Information and Communication Technology (ICT) society. Nowadays, the Internet has become a universal communication platform representing the best expression of the ubiquitous technology envisioned by Weiser (1991) in the late 1980s. Internauts all over the world are connected either for work or personal use. E-mails have, by now, supplanted envelops and stamps. Everyone browses the latest news on his mobile or manages his money and personal data through a netbook seated down at a restaurant offering a free Wi-Fi. A main reason for the growth of this pervasive use of ICT has been the adoption of commercial services paving the way to e-commerce and multimedia services. Every company, university, governmental organization, critical plants (e.g., power plants) are globally connected providing their services through the Internet. Unfortunately, the reliability in the Internet and its services is undermined by network attacks. Personal as well as business computer systems are generally at risk to be remotely compromised and misused for illegal purposes. Today, a plethora of attacks plagues computers connected to the Internet, e.g., computer worms, malwares and trojan horses. Proliferation of these threats is driven by a criminal economy that rests on “business models” such as gathering of confidential data, disruption of services or distribution of spam messages. While a private citizen may receive very limited damages if he/she is a cyber attack victim, this becomes a serious threat to companies and governmental organizations. The WHID 2007 report Footnote 1 by the Web Application Security Consortium revealed that 67% of attacks in 2007 were profit motivated. There are many examples in recent news of cyber attacks. For instance, in 2009 it was revealed that the US power grid had been infiltrated by an intruder, leaving malware that was capable of shutting down the entire grid. In the same year, a major spy network (GhostNet) had been infiltrated more than 1,000 computers around the world, with victims such as foreign ministers and embassies.

Spamming is another example of cyber attack, a user receives unsolicited mail, ranging from more or less obvious commercial to phishing messages specifically designed to resemble legitimate e-mails. Spam mines the reliability of e-mails (Hoanca 2006) and causes so large economical impact that some countries made a legislation to ban spamming (Talbot 2008). It is clear that cyber attacks can threaten national security, prompting USA to open a Cyber Security Office Footnote 2 in the White House in 2009 followed shortly by the UK where the Cyber Security Operations Centre Footnote 3 was founded.

There are several mechanisms that can be adopted to increase the security in computer systems: attack prevention (firewalls, user names and passwords, and user rights), attack avoidance (encryption) and attack detection (intrusion detection systems). As witnessed by a plethora of security literature, machine learning (ML) (Bishop 2006) and soft computing (SC) (Zadeh 1994) methodologies offer the flexibility required to realize efficient tools for computer security and have the generalization capability necessary to address a wide range of security issues. This is why ML and SC techniques have been widely adopted in Computer Security and Computer Forensics (Stahl et al. 2010). ML models are often used to address supervised learning problems, in which a ML model is trained on a proper set of data, called training set along with their corresponding label, called target, or unsupervised learning problems, in which the ML model is trained on the training set alone, namely unlabeled (Bishop 2006). In these two categories of problems fall the most ML and SC methods applied to ICT security applications.

This work aims to provide the trends of the ML and SC methodologies for ICT security. ML and SC applications for three hot topics in ICT security, e.g., password-based schemes for access control, intrusion detection and spam filtering (SF), are overviewed.

The work is organized as follows: Sect. 2 is devoted to the password-based schemes for access control and in particular for password authentication, password strength, proactive password checking and password keystroke dynamics; in Sect. 3 intrusion detection systems are introduced and a taxonomy of the different applied ML and SC paradigms is provided; in Sect. 4 some aspects of the spam filtering are discussed; finally, in Sect. 5 conclusions are drawn.

2 Password-based schemes for access control

A big challenge for computer scientists is the design of secure protocols for protecting partially shared or private resources. In computer security, access control provides the essential services as accounting, authentication and authorization (AAA). Even though several suitable techniques have been proposed in literature over the years (e.g., biometric identification schemes, smart cards) password-based schemes are still frequently used due to their simplicity.

In this section it is provided a survey on the ML and SC techniques recently proposed in the fields of password authentication, password strength, proactive password checking and password keystroke.

2.1 Password authentication

Password authentication is one of the mechanisms that is widely used to authorize a legitimate user (e.g., access to accounts, PINs, files, and so on) and that does not require the use of more expensive devices. Typically a keypad is required to record and save in a table on a non-volatile storage a (user ID, password) combination. Successively, this information is used to authenticate an user. The vulnerability of this system consists in the possible access of the password information table by an intruder. Moreover, insecurity can be limited by encrypting the (user ID, password) pairs prior to saving them in storage (Sibai et al. 2009). In the encrypted case the verification table need not be kept secured, because an intruder cannot decipher the original passwords from what is stored in the table. Nevertheless, this technique has some shortcomings. An intruder is still able to append a forged pattern to the verification table or replace someone encrypted password.

Li et al. (2001) proposed a remote password authentication scheme based on a Neural Network (NN), Multi Layer Perceptron (MLP) (Bishop 1995; Haykin 1998). This system identifies the legitimate user in real time and it is also applicable to a multiserver network architecture, where the input pattern is the user password and the output is the serviceable server. The authors compared the following supervised models: Back-Propagation NN, Sum-of-Product Network, Hybrid Sum-of-Product Network. Successively, Wang and Wang (2008) proposed to use a recurrent associative memory, Hopfield NN (HNN)-based methodology (Haykin 1998), that has all the advantages of the approach in (Li et al. 2001) but eliminates the disadvantages of the NN method, e.g., long training time and recall approximation. The overall authentication scheme, that incorporates the HNN, can recall information for a legal user ID and password instantly and accurately. The scheme can be used for any access control of computing resources, such as multiple server access permission or role-based security control. Reyhani and Mahdavi (2007) trained a Radial Basis Function (RBF) NN (Haykin 1998) to store encrypted passwords. One of the advantages of RBF NNs is the faster training compared to the MLP network ones. However, it is worth remarking that in both works by Reyhani and Mahdavi (2007) and Wang and Wang (2008) during the authorization process the system must compare the NN-generated password with the one provided by the user determining the match. In this way, the password can be decrypted if the key is compromised or the encrypted password can be saved on an external storage. For this reason, Sibai et al. (2008, 2009) introduced a methodology that allows to perform the comparison by using a MLP NN without using a comparator module. In this case, the output of the trained NN is 1 for a (user ID, password) access authorization, or 0 for access denial. Finally, Singh (2009) gave a different methodology to design the security system by using NN having the intrusion detection capability too. This is possible by the analysis of several logical parameters associated with the user activities.

2.2 Password strength

A password is a secret string of characters used for authentication. Password strength is the measurement of the effectiveness of a password in resisting to guessing and brute force attacks. The key to a strong password is the length and the complexity, i.e., the number of individual characters and the number of characters that could potentially be used in its creation, respectively. Most password strength checking tools rates the password as very weak, weak, moderate or medium or good, strong and very strong. Salem et al. (2008) presented a SC based approach to measure strength of passwords that may help to prevent dictionary, brute-force and shoulder surfing attacks simultaneously. The authors proposed to use Fuzzy Sets (Zadeh 1994) for each of the following three factors:

  • length of the password phrase (supposed to be adequately long to avoid time crack attacks and human memory);

  • dictionary based passwords;

  • entropy of the password (entropy estimates the time to crack the password Footnote 4).

In that work (see Fig. 1) each input has three membership functions (e.g., length has values in linguistic variables Short, Medium and Long) and the composition is obtained by fuzzy inference systems. The output is the password strength obtained after a defuzzification module (Zadeh 1994). Jamuna et al. (2009), in a further work, introduced a Support Vector Machine (SVM) (Shawe-Taylor and Cristianini 2004) based methodology to analyze the strength of the password. In particular, linear and nonlinear SVM classification models were considered and trained using the features extracted from a password dataset. Twenty-seven descriptive features were created as a fixed length vector. The performance of the model was evaluated using tenfold cross validation Footnote 5 (Hastie et al. 2001) and observed that SVM classifier trained with RBF kernel performs well (see also Suganya et al. 2010). Finally, Vijaya et al. (2009) compared several supervised ML techniques namely C4.5 Decision tree classifier (Hastie et al. 2001), MLP, Naive Bayes Classifier (Duda et al. 2000) and SVM to learn a model for password strength prediction. The results of the models were also compared with the existing password strength checking tools.

Fig. 1
figure 1

Password strength obtained by using a fuzzy process with IF-THEN fuzzy rules as presented by Salem et al. (2008)

2.3 Proactive password checking

The password checking models are essentials for selecting hard to guess passwords. In particular, the passwords chosen by users are checked for possible weaknesses and, in this case, the user is asked to select a different password. Two different approaches exist for password checking:

  • Reactive Programs such as “cracks” are run periodically by the system administrators to find weak passwords.

  • Proactive When a user selects a password, the system checks immediately to verify whether it is acceptable.

In password checking a disadvantage is the space required by dictionaries and the time required for checking. An interesting approach for designing a proactive password checker is the one proposed by Bergadano et al. (1998). They viewed the problem of password classification as a ML problem. The system, in a training phase, get the knowledge for distinguishing weak passwords from strong ones, by using dictionaries of examples for both weak and strong passwords. This knowledge was represented by means of a decision tree. The same technique was subsequently applied by Blundo et al. (2004). Therein the previous checker is improved by using a ML technique for the construction of the decision tree, i.e., the Minimum Description Length (MDL) (Hastie et al. 2001) principle. Substantially, the proactive password checker prevented the choice of easy-to-guess passwords using a decision tree constructed by applying the minimum description length principle and a pessimistic pruning technique. Moreover, Ruffo and Bergadano (2005) designed a new proactive password checking system (EnFilter). It is composed of a set of configurable filters that use decision trees, lexical analysers, as well as Levenshtein distance (Cormen et al. 2009) based techniques. Blundo et al. (2002) put forward the possibility of using NNs for proactive password checking. Instead of using standard computing techniques the classifier was implemented by means of a single perceptron (Haykin 1998). Successively, Ciaramella et al. (2006a) proposed to use MLP networks. The authors evaluated the performance of several network topologies and different pre-processing techniques e.g., Principal Component Analysis (PCA) (Bishop 1995) and compared the results with RBF NN and Fuzzy Relational (FR) NN (Ciaramella et al. 2006b) approaches. The solution has the main advantage that such checkers might be easily implemented using the smart card technology, too.

2.4 Password keystroke dynamics

In typing a phrase or a string of characters, the typing dynamics or timing pattern can be measured and used for identity verification. More specifically, a timing vector consists of the keystroke duration times interleaved with the keystroke interval times at the accuracy of milliseconds. Combining the keystroke dynamics identity with the password while users access computer systems is a considerably effective way to verify the valid access due to the unique keystroke dynamics of each person. Several works have been made by using MLP for keystroke dynamics identity. Lin (1997) and Cho et al. (2000) introduced MLP NNs to discriminate valid users and impostors according to each individual password keystroke pattern. However, the approach had some limitations:

  1. 1.

    it took too long to train the model;

  2. 2.

    data were preprocessed subjectively by a human;

  3. 3.

    a large data set was required.

Yu and Cho (2004) introduced a combination of approaches and models that could solve or alleviate these limitations. First of all, a SVM was proposed for novelty detection Footnote 6 in order to build a model of the owner keystroke dynamics and use this to detect imposters using similarity measures. A wrapper feature selection approach was employed which was able automatically to select a relevant subset of features and ignore the rest, thus producing a better accuracy. In particular, a Genetic Algorithm (GA) (Mitchell 1996) based wrapper approach is used. Finally, an ensemble model (Hastie et al. 2001) based on feature selection (FS-Ensemble) was proposed to alleviate the deficiency of a small training data set.

Other ML approaches, based on SVM have been used to address the classification problem presented by keystroke dynamics. Sang et al. (2005), de Oliveira et al. (2005) and Sung and Cho (2006) have applied SVM to a small keystroke dataset and compared their results to standard NN technology. Furthermore, ML techniques were presented by Kang et al. (2007) where the k-means clustering algorithm (Duda et al. 2000) was used whereas Revett et al. (2007) provided evidence that a Probabilistic Neural Network (PNN) (Haykin 1998) outperforms MLP in terms of reduced training time and classification accuracy. Zhao (2006) compared different ML classification methods as decision tree, Naive Bayesian, Instance Based Learning, Decision Table, One Rule, Random Tree and K-star (Duda et al. 2000). Compared with the conventional Nearest Neighbour method (Bishop 1995) these learning methods, especially decision trees, can be more accurate. Instead, the main objective of Killourhy and Maxion (2009) was to collect a keystroke-dynamics dataset, in order to develop a repeatable evaluation procedure, and to measure the performance of a range of 14 detectors from the keystroke dynamics and pattern-recognition literature. Finally, various Fuzzy Logic (Zadeh 1994) based algorithms have been proposed, too. For instance, Hussien et al. (1989) and de Ru and Eloff (1997) used a combination of fuzzy clustering algorithms (Lin and Lee 1996) depending on the number of acquired (user ID, password) samples. Haider et al. (2000) assigned ranges of typing times to fuzzy sets. Revett et al. (2005) used the Rough Sets (Pawlak 1982) induction algorithm to extract rules that form models for predicting the validity of a (user ID, password).

3 Intrusion detection systems

As users, people depend on the Internet in their daily life for simple tasks such as checking e-mails, but also for managing private and financial information. However, sending such information through Internet also means that the network has become more and more an appealing place for hackers. To face this threat, the research community has answered with an increasing interest in Intrusion Detection, that results in developing new methods to timely detect intruders and prevent damages. Therefore Intrusion Detection Systems (IDSs) have become an indispensable component of security infrastructure to detect these threats before they inflict unrecoverable damages.

3.1 General overview of an IDS

Following Kruegel et al. (2004), “Intrusion detection is the process of identifying and responding to malicious activities targeted at computing and network resources”. An IDS dynamically monitors the events taking place in a system, and decides whether these events are symptomatic of a violation or constitute a legitimate use of the system (Wu and Banzhaf 2010). Several kinds of violations are possible, ranging from the abuse of privileges to the use of attacks for exploiting software or protocol vulnerability. Intrusion detection can be considered as a ML problem, discriminating between network attacks and normal network behaviors or furtherly distinguishing between different categories of attacks. The typical organization of an IDS is shown in Fig. 2. Intrusion detection can be taxonomized into two families according to the kind of input information they analyze:

  • host-based detections (HIDS), analyze host-bound audit sources such as operating system audit trails, system or application logs collected from the target host machine. Since the information provided by the audit data can be extremely rich and complex, host-based approaches can obtain high detection and low false-alarm rates. Nonetheless, disadvantages exist for host-based detections. Firstly, it is difficult for HIDSs to prevent attacks, since when an intrusion is detected, the attack is partially occurred. Secondly, audit data may be altered by attackers, influencing their reliability.

  • network-based detections (NIDS), analyze network packets that travel through switches and routers. Although such information is not so rich as the audit data of the target host machine, there are advantages for network-based approaches, i.e., they can detect “distributed” intrusions over the whole network and thus lighten the burden on each individual host machine for detecting intrusions. Besides, NIDSs can defend the machine against attacks, as detection occurs before the data arrive at the machine.

Fig. 2
figure 2

Typical organization of an IDS: thick arrows indicate data and/or control flow, while thin arrows indicate responses to intrusive activities

Furthermore, intrusion detection can be divided in two broad categories (Hu et al. 2008), according to the detection methods they employ, namely:

  • misuse detection, where a search for the traces or patterns of well-known attacks is carried out. Misuse detection has high detection rates for the well-known intrusions but fails to detect novel intrusions. An alarm is generated if a previously specified pattern is recognized. The strength of a misuse-based IDS lies in being highly accurate, i.e., it rarely raises an alarm due to a normal activity. On the other hand, its effectiveness depends on the completeness of the signatures. Therefore, a misuse-based system cannot recognize new attacks.

  • anomaly detection, which involves the use of a model of a normal user or system behavior, usually known as user or system profile, and highlights significant deviations from this model as potentially malicious (Patcha and Park 2007). The main advantage results in detecting potentially attacks that have never been seen in advance (Owezarski et al. 2010). However, there exist cases in which events that deviate from the system profile are not necessarily malicious. The goodness of an anomaly detector is its ability to detect previously unknown attacks.

It is worth observing that the IDS classes discussed so far, represent a possible categorization. In fact many flavors of IDSs have been proposed leading to several IDS taxonomies (Debar et al. 1999). Thus, IDSs can be also classified according to the type of analysis (“real-time” or “offline”), the type of data processing (“centralized” or “distributed”) or the response to the intrusion (“passive” or “active”). An IDS aims to discriminate between intrusion attempts and normal activities. In doing so, however, an IDS can introduce classification mistakes. A false positive is a normal input for which the system erroneously raises an alert. A false negative, on the other hand, is a malicious input that the IDS fails to report. The correctly classified input data are usually referred to as true positives (attacks) and true negatives (normal traffic).

3.2 ML and SC approaches to IDS modeling

ML and SC techniques have been applied to IDSs at different levels (from network-based to host-based IDSs, from misuse to anomaly detection) (Tsai et al. 2009; Wu and Banzhaf 2010). It can divide the different ML and SC techniques in four families: supervised learning-based approaches, unsupervised learning-based approaches, statistical modeling-based approaches and ensemble-based approaches.

3.2.1 Supervised learning-based approaches

In this family NN and SVM based approaches have been mainly proposed.

  • NN-based approaches: in early developments of IDSs (Wu and Banzhaf 2010), MLP NN were applied primarily to anomaly detection on user behavior level (Ryan et al. 1998; Tan 1995). Tan (1995) used command sets, CPU usage, login host addresses to discriminate between normal and abnormal behaviors, whereas Ryan et al. (1998) considered the pattern of commands and their frequency. In Ghosh et al. (1998) and Ghosh and Schwartzbard (1999) instead, a MLP was built to deal with software behavior described by sequences of system calls. A leaky bucket algorithm (Kurose and Ross 2010) was used to remember anomalous events diagnosed by the network, so that the temporal characteristics of program patterns were accurately captured. Research, such as Hofmann et al. (2003) and Rapaka et al. (2003), employed RBFs to learn multiple local clusters for well-known attacks and for normal events. This is because RBF NN are more suitable for problems with large sample size (Chan et al. 2005; Wu and Banzhaf 2010). Recurrent (RNNs) (i.e., Elman model, Haykin 1998) were used for detecting attacks spread over a period of time (Al-Subaie and Zulkernine 2007; Cheng et al. 2005). Chan et al. (2005) used a RNN to detect network anomalies exploiting the temporal locality property of the KDD99 network traffic dataset; whereas Al-Subaie and Zulkernine (2007) used a RNN classifier for the UNM system call dataset (Wu and Banzhaf 2010).

Han and Cho (2006) employed evolutionary NNs to detect intrusions. Zhang et al. (2005) proposed a hierarchical RBF NNs for intrusion detection. The same problem was tackled by Toosi and Kahani (2007) using the Adaptive Neuro-Fuzzy Inference System (ANFIS) (Lin and Lee 1996), as neuro-fuzzy classifier. The system was combined with a genetic algorithm to optimize the fuzzy decision-making engine obtained by the fuzzy inference approach.

  • SVM-based approaches Mukkamala et al. (2002) and Sung and Mukkamala (2003) used SVMs for discriminating between normal network behaviors and intrusions and furtherly identify significative features for intrusion detection. Besides, Mill and Inoue (2004) proposed the TreeSVM and ArraySVM for solving the problem of inefficiency of the sequential minimal optimization algorithm (Platt 1999) for very large training dataset in intrusion detection. To cope with this problem, Zhang and Shen (2004) proposed an approach for online training of SVMs for real-time intrusion detection based on an improved text categorization model. Finally, Ren et al. (2009) combined SVM and rough sets to use Internet protocol for intrusion detection.

3.2.2 Unsupervised learning-based approaches

Supervised learning methods for intrusion detection can only detect known intrusions, namely only the intrusions that the system has seen and learnt previously when it has been undergone to the training. Unsupervised learning methods can detect the intrusions that have not been previously learned. Wang et al. (2006) and Hoglund et al. (2000) employed Self-Organizing Maps (SOMs) (Haykin 1998) to learn patterns of normal system activities from audit data. The previous approaches were enhanced by Sarasamma et al. (2005), who proposed a hierarchical SOM. In this way, they used the SOM classification capability on properly selected dimensions of the data set to detect anomalies. An alternative approach was suggested by Jiang et al. (2006) who employed an extension of theK-means (Duda et al. 2000) algorithm to detect intrusions. Siripanwattana and Srinoy (2008) combined rough sets and fuzzy c-means (Bezdek 1981) for anomaly-based network detection, whereas Srinoy et al. (2005) experimented an Independent Component Analysis (ICA) (Hyvärinen et al. 2001) for feature selection combined with a rough-fuzzy clustering. Both the approaches allow to recognize known attacks and to detect suspicious activity possibly resulting in a new, previously unknown, attack. In order to overcome the shortages of a single-level structure, which is only able to detect either misuse or anomaly attacks, Liu et al. (2007) designed a hierarchical intrusion detection model using PCA neural network. Finally, Shon and Moon (2007), suggested a new SVM approach, named Enhanced SVM, which combines soft-margin SVM and one-class SVM (Shawe-Taylor and Cristianini 2004) in order to provide unsupervised learning and low false alarm capability to detect anomalies, similar to that of a supervised SVM approach.

3.2.3 Statistical modelling-based approaches

Statistical methods are the first techniques applied to IDS research, among ML and SC methods. According to this approach, the normal user behavior is described in terms on what is acceptable within the system usage policies. Using various statistical modeling techniques, user behavior is monitored and, if there is any deviation from predefined normal behavior, anomaly activity of users will be considered an attack. Farid and Rahman (2008) employed an adaptive Bayesian algorithm (Duda et al. 2000) to recognize different attack types on KDD99 data set. Alternatively, Qiao et al. (2002) applied a Hidden Markov Model (HMM) (Rabiner 1989) on system calls to detect anomalous intrusions. They identified various state transitions that a special UNIX based process goes through from the start to the end. The HMM was applied to all the collected system calls specific to that process. Using these state transition sequences, they built a database of normal sequences and then monitored system call sequences against the database to detect anomaly. Wright et al. (2004) proposed a formalization of the traffic exchange in terms of HMM profiles to classify traffic sequences at application level. Dainotti et al. (2008), instead, described a HMM-based packet-level model of traffic sources. In addition, a second fruitful application of the model (Dainotti et al. 2008) is the short-term prediction of future traffic behavior. Behavioral models for host-based intrusion detection have been proposed by Gao et al. (2006). The authors profiled the normal sequence of system calls and raised alarms whenever a sequence is unlikely to be seen. Finally, Sperotto et al. (2009) applied HMM to represent a flow time series model of SSH brute-force attack for successful emulating the attacker behavior.

3.2.4 Ensemble-based approaches

As previously discussed, many classification approaches for ML and SC have been applied to improve detection accuracy, and to reduce false positive errors, as well. Nevertheless, every approach has its strengths and weaknesses, resulting in various accuracy levels on different classes. Even models built by the same algorithm show differences in misclassification. Therefore, several authors (Abraham and Jain 2005; Abraham et al. 2007; Peddabachigari et al. 2007) investigated the possibility of assembling different learning approaches to detect intrusions. In these studies, they trained and tested a decision tree model, a linear genetic program model (Mitchell 1996), and a fuzzy classifier model on the KDD99 dataset, respectively. They observed in the experiments that different models provided complementary information about the patterns to be classified. Therefore, instead of using a single model to classify all classes, they selected the best model for each class, and then combined them in order to maximize computational efficiency and detection accuracy. Techniques, such as majority vote or winner-takes-all (Duda et al. 2000), are often used to decide the output of an ensemble model (Duda et al. 2000) when the predictions of different models conflict. Gudadhe et al. (2010) studied ensemble of decision trees, combined through boosting (Schapire 1990) to recognize attacks from network data flow. A special type of decision trees, namely decision stumps, Footnote 7 were combined by Hu et al. (2008). Each decision stump was used as weak classifier in order to obtain a strong classifier by means of AdaBoost (Freund and Schapire 1996). In addition, Panda and Patra (2009a, b) compared Adaboost with Random Forest (Hastie et al. 2001) and several rule based classifiers, in constructing efficient network intrusion detection models. Finally, Chan et al. (2005) used RBF NN to merge results from multiple classifiers.

4 Spam filtering

Unsolicited e-mail messages are called spam (Hoanca 2006; Castiglione et al. 2011). Spam mines the e-mail reliability and causes so large economical impact that some countries made a legislation to ban spamming (Talbot 2008). However, the legislation is often ineffective due the difficulties in identify the real sender of spam messages. A different approach against spamming consists in the use of spam filters (Goodman et al. 2007). After having identified a spam message, the spam filter can usually apply two different strategies. The former consists in moving the spam message in an appropriate folder containing only spam. The latter labels the message as spam leaving the user the decision of erasing the message. Early spam filters employed user-defined rules, identified on the basis of the knowledge of the regularities observed in spam message. Nevertheless, spammers, i.e., people that send spam e-mails, generally use tools (Stern 2008) that have the aim of minimizing the number of spam messages that may be identified by spam filters. For instance, spammers usually employ obfuscation techniques consisting in masking the terms that are very frequent in the spam e-mails, e.g., replacing “free" with “fr\(\epsilon\epsilon\)”. In this way, obfuscation techniques make harder the identification of a spam message by spam filters. At the present time, the former rule-based spam filters are replaced by ML and SC based filter tools. ML and SC techniques can extract knowledge from a limited collection of labeled messages where the labels are either “legitimate" or “spam". Then, ML and SC methods can use the extracted knowledge for classifying correctly new (i.e., not previously seen) e-mail messages. Given a collection of training labeled documents \(\mathcal{T} \subseteq \mathcal{D},\) where \( \mathcal{D}\) is the universe of all possible documents, the learning problem consists in estimating a decision function \(f: \mathcal{D} \rightarrow \{0,1\}, \) where 0 corresponds to a legitimate message and 1 to a spam. ML and SC applications for spam filtering can be divided in two big families: text-based spam filtering and image-based spam filtering.

4.1 Text-based spam filtering

Text-based spam filtering can be viewed as a text categorization (TC) problem (Sebastiani 2002), i.e., assigning a text to a class (or category) previously defined. A typical text classification problem consists in assigning a news to a given category, e.g., sport, politics. Although TC is a good approach for handling text-based spam filtering, spam filtering has some proper characteristics. They should be taken into account if it wants that a ML and/or SC text-based spam filtering guarantees an adequate degree and fidelity to the real world. Main proper characteristics of spam filtering include (Fawcett 2003): skewed and changing class distributions, unequal and uncertain missclassification costs of spam and legitimate messages, and the concept drift, namely a change in the significatives terms in spam messages.

Having said that, the usual structure of text-based spam filter is examined. The information contained in an e-mail message can be divided in two parts: the header and the body (Guzella and Caminhas 2009). The header contains the general information on the e-mail, i.e., the subject, the sender and the recipient; whereas the body contains the message. If it wants to use ML or/and SC methods in a text-based spam filter, some preprocessing steps have to be performed. The preprocessing steps can be divided in:

  1. 1.

    tokenization, that consists in extracting the words from the body of the message and eliminating the punctuation signs, i.e., fullstops, commas, parentheses, dashes and so on.

  2. 2.

    stopping, that remove stop words, that are the most frequent words in a message. Examples of stop words are articles, conjunctions, prepositions, auxiliary and modal verbs, adverbs.

  3. 3.

    stemming that reduces each word to its linguistic root, i.e., the stem. For instance, the stem of the word “going" is “go". Whereas, for the English language a reliable stemming algorithm (the Porter’s algorithm, Porter 1980) is available, for other languages, e.g., the Italian language, a reliable stemming algorithm does not exist. In this case stemming process may be replaced with lemmatization, which consists in reducing the word to its lemma.Footnote 8

  4. 4.

    representation, which converts the set of remaining words present in the message to a format, suitable to be processed by the selected ML and/or SC algorithm.

4.1.1 Representation of text

The representation of text in TC is crucial and therefore assumes the same relevance in text-based spam filtering. A very popular representation is the so-called Bag of Words (BOW) also known as the vector-space model (Salton et al. 1975).

Given a set of terms \(\mathcal{W}= (w_1, w_2, \dots, w_n)\) a priori chosen, a document d is represented as a N-dimensional feature vector \({\mathbf x}=(x_1, x_2, \dots, x_N),\) with \(x_i=g(f(w_i)), i=1,\dots, N, \) where f(w i ) is the occurrence of the term w i in the document d and \(g(\cdot)\) is an appropriate function. Another possible approach consists in using n-gram word models (Zorkadis and Karras 2006), which considers sequences of words. For instance, if it considers the sequence "soft computing algorithms", the word 2-gram are "soft computing" and "computing algorithms". In a binary representation, a feature x i is equal to 1 if the term w i occurs in d, and 0 otherwise. A more popular feature representation is tf-idf, i.e., term frequency-inverse document frequency (Sebastiani 2002), where the feature x i associated to the term w i in the document (message) \(d \in \mathcal{T}\) is given by:

$$ x_i =n_{w_i d} \log \left(\frac{|{\mathcal{T}}|}{n_{w_i}} \right), $$

where \(n_{w_{i d}}\) is the number of occurrences of the term w i in the document d and \(n_{w_{i }}\) the number of occurrences in the term w i .

Given a set \(\hat{\mathcal{W}}\) composed of all the terms from the collection of training documents \(\mathcal{T},\) before the representation of the document, it is usual to apply some feature selection algorithm which aims to identify a subset \(\mathcal{W}\subseteq \hat{\mathcal{W}}\) containing only the most representative terms. Several methodologies for TC have been proposed (Sebastiani 2002). All these methods compute a score for each term and the terms are sorted on the score basis. Finally, the top N terms are picked, where N is a value a priori fixed. The information gain (or mutual information, Zhang et al. 2004) is the most popular feature selection algorithm. The information gain algorithm assigns to each term w i the score τ(w i ) given by:

$$ \tau(w_i) = \sum_{c \in \{0,1\}} \sum_{t \in \{w_i, w_i^{\prime} \} } P(t,c) \log \left[\frac{P(t,c)}{P(t) P(c)}\right] $$

where \(w_i^{\prime}\) denotes the absence of w i . Finally, it is necessary to underline some points of weakness of the BOW approach, even if it is the most popular one. The most relevant weakness of the BOW approach is the concept drift that is a peculiar characteristic of spam. Since the structure of the feature vector is constant, i.e., has always the same elements, a term that was not initially included, during the model construction, cannot be considered in future (Gabrilovitch and Markovitch 2007). For instance, it supposes that it has trained a classifier with feature vectors containing three features related to the terms "bank", "euro", "account". This classifier cannot consider either variations of three terms in new e-mail messages, such as “\(\epsilon\)uro" or new terms such as “win". In order to consider a new term, it needs to retrain the classifier, that might be cumbersome if carried out frequently.

4.1.2 Classifiers for spam filtering

The first classifier proposed for spam filtering was the naive Bayes classifier (Duda et al. 2000). The Bayes classifier was introduced by Sahami et al. in the Technical Report WS-98-05. They tackled the problem of spam filtering in a Bayesian framework estimating the probability that a given message is spam. The Bayesian framework allows to integrate evidence from different sources. In this way, the authors also used non-textual features, e.g., the time when the e-mail was sent, in addition to the usual textual features. Later on, SVMs were applied by Drucker et al. (1999) to spam filtering. They used the BOW representation with binary or tf-idf features, selected by means of the information gain on private corpora. Compared with boosting based on decision trees, SVMs showed a slighty higher false positive rate but more robustness to different corpora and preprocessing procedures and required less time for the training. Moreover, SVMs required no feature selection procedure since can cope with a large number of features. The best results were obtained using a binary representation for SVMs and frequency-based for boosting. Clark et al. (2003) proposed LINGER, which is based on a MLP for e-mail categorization and spam filtering. Message were represented as BOW with feature selection based on information gain. Experiments, performed on public corpora, showed that LINGER outperformed the naive Bayes classifier obtaining very small false positive rates. Nevertheless, when the spam filter was trained and tested on different datasets the results degraded notably. Goodman and Yih (2006) proposed a logistic regression model (Hastie et al. 2001) for SF. They used binary features, taking into account if the feature occurs on the message header or body. The system developed, tested on two public corpora, obtained competitive results with the best spam filters. Oda and White (2003) proposed an Artificial Immune System (de Castro and Timmis 2002) for spam filtering. The system yielded promising results, namely true positive and negative rates of 90 and 99%, respectively. Carreras and Marquez (2001) used a variant of Adaboost, with decision trees as base classifiers and binary features. Adaboost tested on a public corpus outperforms decision trees and the naive Bayes classifier. Finally, Zhao and Zhang (2005) applied rough sets to classify messages into three classes: spam, legitimate and suspicious. In the first step, features are selected from the training set. Then a set of rules is deduced by means of a genetic algorithm. The set of rules divide the universe of messages into three regions. Tested on a public corpus outperformed the naive Bayes classifier.

4.2 Image-based spam filtering

Given the increasing number of spam messages containing images, spam filters need not only handle textual content but they also examine the image content. Aradhye et al. (2005) developed a system for identifying images characteristic of spam messages, based on the extraction of image regions containing text. It is faster than an OCR Footnote 9 since it needs not the text identification. They used a polynomial kernel SVM, fed by features, such as the percentage of the message containing text. This is selected for discriminating between images of spam messages and images (e.g., photos) included in legitimate messages. The results obtained on a private corpora indicated that the system can identify 70–80% of spam messages and 70–100% of legitimate spam containing images. Wu et al. (2005) proposed an approach for analyzing visual features of images attached to e-mails. The features extracted include, for example, the number of regions containing embedded text and the number of images containing such text. Due to the difficulties in collecting legitimate e-mails with attached images, the authors employed, as classifier, one-class SVM. Using a test composed of synthetic legitimate messages artificially generated and real spam e-mails with attached images, one-class SVM, compared with the usual two-class SVM, showed a reduced number of false positive despite of the increase of the number of false negatives. Finally, Fumera et al. (2006) proposed a system for carrying out a semantic analysis of the text embedded in the images attached to e-mails. They used an OCR for extracting the text and then performed the analysis of the content by means of a linear SVM, fed with features selected on the basis of the information gain. Experiments, driven on a public corpus (e.g., SpamAssassin corpus), showed that the system, using only the text extracted from the images, yielded the smallest false negative rate and low false positive rates (∼1%). The authors concluded that considering the text embedded in the images can improve notably the performances of the spam filters.

5 Conclusions

This work overviewed ML and SC applications for three hot topics in ICT, i.e., security password-based schemes for access control, intrusion detection and spam filtering. The different number of ML and SC techniques applied to the previous problems are summarized in Table 1. It has been shown that ML and SC have been widely applied in ICT security, due to the fact that they allow a system to reply to changeable real-world inputs and learn to identify undesiderable behaviors. In this way, ML and SC are becoming more and more a very important tool for Computer Security. Unlike other ML and SC applications, the ones in the Computer Security have to work under adversarial conditions (Barreno et al. 2010). Adversarial conditions mean, for instance, that an attacker can attempt to use the adaptive aspect of ML and SC systems to cause them to fail. Developing secure learning algorithms, i.e., algorithms that can work under more disparate adversarial conditions is a big challenge for ML and SC researchers in the next future.

Table 1 For each category of methods the number of works related to security password-based schemes for access control (SPBS), intrusion detection systems (IDs) and spam filtering (SF) are reported