1 Introduction

In the last decade, with the rapid growth in information technology, security has become one of the major concerns of almost all the institutes and companies. Intruders are in continuous search to access the network traffic, and that is why, security systems have to cope with different types of attacks [1]. Secure data transfer over a network has always been a challenging task. In some cases, such as banking transactions, defense-related communication, traffic of government-related networks, the security of network traffic becomes increasingly important. Consequently, it is imperative to have a secure network monitoring and detection system that can efficiently detect different types of attacks and thus secure a network against malicious attacks. An intrusion detection system (IDS) monitors and detects violations regarding security policy and recognizes any abnormal patterns in the network traffic [2]. In host-based IDS, only information related to host is monitored [3]. In case of network-based IDS, instead of monitoring individual hosts in the network, an overall network flow is monitored and analyzed [4]. However, in application-based IDS, a system monitors any abnormal behavior of protocol between the different devices [5].

An IDS can be considered as an expert system. The system is first trained over a large dataset for selecting features. After training and recognizing patterns as normal or malicious, the working of IDS is tested on unseen data. The performance of an IDS mostly depends on the selected features. Intrusion detection techniques are broadly categorized into two main groups. The first one is called anomaly-based IDS, in which those data patterns are detected, which show abnormal behavior [6]. The advantage of anomaly-based IDS is that zero-day attacks can be detected; however, it may have a high false positive rate. The second category of IDS is signature- or misuse-based IDS, in which patterns are known in advance. For example, a labeled intrusion detection dataset has instances, which are labeled as normal or intrusive. By using machine learning algorithms, a trained model can automatically detect intrusive instances [7]. However, zero-day attacks are difficult to be detected by signature-based IDS.

In past, many different intrusion detection techniques have been reported by different researchers. Malhotra et al. [8] used K-nearest neighbor and genetic programming for developing a reliable IDS. To overcome the deficiencies of IDS associated with single-level structure technique, a hierarchical approach based on neural networks is recommended by Zhan et al. [9]. On the other hand, Panda et al. [10] used naïve Bayes classifier for developing an efficient IDS. Zhang et al. [11] used random forest-based classifier for the purpose of detecting an intrusion within the network. Portnoy et al. [12] proposed a clustering-based technique that uses unlabeled data for detecting an intrusion within the network. Similarly, an IDS trained in an unsupervised way was reported by Leung et al. [13]. Aslahi et al. [14] reported a hybrid approach for intrusion detection system. Another network-based IDS was developed by Ramakrishnan et al. [15] in which features are selected on the basis of entropy. Similarly, a technique that detects intrusion in wireless ad hoc networks was proposed by Srinivasan et al. [16], whereby Kohen’s self-organizing map identifies abnormal patterns within the network. In contrast, Puri et al. [17] proposed a hybrid technique that combines regression and classification trees along with support vector machine (SVM) for identifying network attacks. In a different work, in order to ensure network security, SVM and neural network-based IDS are developed by Mukkamala et al. [18].

Recently, a fuzzy logic-based semi-supervised algorithm has been proposed for intrusion detection, in which a single-layer neural network is used to obtain fuzzy membership function [19]. Similarly, Kim et al. [20] used deep neural network-based approach for IDS. Another tree algorithm-based approach for network intrusion detection was reported by Subasi et al. [21]. Almost all of these intrusion detection techniques use machine learning algorithms that need to be trained from scratch. In contrast, the proposed deep neural network and adaptive self-taught-based transfer learning (DST-TL) technique uses the concept of self-taught learning for developing efficient and reliable IDS. Instead of training the model with original features as in previously reported methods, the proposed DST-TL technique uses a self-taught learning approach for extracting feature through an already trained network (using unlabeled data). The extracted features are then combined with the original features and are used to train a sparse auto-encoder. The organization of the paper is as follows: Sect. 2 is related to the description of the dataset. Similarly, Sect. 3 presents the proposed technique. Implementation and parameter details are discussed in Sect. 4. Results and discussion are described in Sect. 5. Section 6 is related to conclusions.

2 Details of dataset

In 1998, MIT Lincoln Laboratory generated an environment in which raw data from TCP dump were gathered throw local area network (LAN) that simulates Air Force LAN in the USA [22]. Each TCP connection within the network is labeled either as attacked or a normal. Moreover, training and test data contain five and two million records, respectively. In 1999, a version of the data collected from MIT Lincoln Laboratory is used in KDD contest related to intrusion detection. The purpose of gathering the data was to analyze and evaluate the research related to detecting intrusion within the network. The performance of proposed DST-TL is evaluated on NSL-KDD dataset (an updated version of KDD dataset). KDD dataset is among the few publically available network-based IDS datasets. There are some inherent deficiencies [23] in KDD dataset; some of which are:

  1. (a)

    Almost 78% and 75% of the samples are repeated in training and test sets, respectively.

  2. (b)

    Redundant samples in training set may make the classifier biased toward the most frequent samples; as a result, good performance may not be achieved on the test data.

To overcome the deficiencies of KDD dataset, NSL-KDD dataset was introduced by Tavallaee et al. [23]. All the redundant samples in KDD dataset were removed, and only a single copy was kept in training and test sets. Moreover, in both training and test sets, difficulty level against each record of KDD data was also assigned. For that purpose, the training set was divided into three equal parts. After the division of training data, seven different classifiers were trained on different portions of the data; as a result, 21 different labels were predicted against test data records. From KDD dataset, after the removal of redundant records \(KDDTest^{ + }\) was generated. In addition to \(KDDTest^{ + }\), a new test set \(KDDTest^{ - 21}\) was also generated containing only those records, which were not correctly classified by any of the trained classifiers. For experimental purposes, only 20% of the \(KDDTrain^{ + }\) was reserved. Therefore, performance of a classifier needs to be evaluated on \(KDDTest^{ + }\), especially on \(KDDTest^{ - 21}\). Percentages of the positive and negative samples in \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) are shown in Table 1. In NSL-KDD dataset, each sample consists of 41 features, which are further categorized into three different groups.

Table 1 Percentage of positive and negative samples in \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) dataset
  1. 1

    Basic features The TCP/IP connection-related attributes are included in basic features. The basic features are originated from the packet header, and out of 41 features in NSL-KDD dataset, ten features are categorized as basic features. For example, protocol type, service, flags, etc. are considered as basic features.

  2. 2

    Content features Content features are related to all of the suspicious information within the payload of a TCP packet. For example, the number of failed login efforts is considered as a content feature.

  3. 3

    Traffic features Traffic features are calculated with reference to the time interval and are further categorized into two groups:

    1. (a)

      Same service features Same service features are features that are evaluated across only those connections from past 2 s that contain the same service as current connections hold.

    2. (b)

      Same host features Same host features are features (behavior, services of protocol, etc.) that are evaluated across those connections in past 2 s, which contain the same host as current connections hold.

Table 2 shows details of the 41 features present within NSL-KDD dataset.

Table 2 Types of features present within NSL-KDD dataset

All the attacks in the NSL-KDD data belong to the following four attack categories:

  • Denial-of-service (DOS) attacks

  • User-to-root (U2R) attacks

  • Remote-to-local (R2L) attacks

  • Probing (PROB) attacks

Details related to all of the four categories of attacks are shown in Table 3. Another property of the NSL-KDD dataset is that an additional 14 new attacks also are included in the test data.

Table 3 Different types of attacks in NSL-KDD dataset

3 Deep sparse auto-encoder

Sparsity is used for reducing the links within the network and thus generally increases the generalization performance of the machine learning technique. In past, different researchers proposed the sparse representation-based classification system for recognizing face [24, 25], object categorization [26], classification [27], and regression-related [28, 29] tasks. Lu et al. [24] proposed locality-based weighted sparse representation classifier (WSRC) for the face recognition-related task. Performance comparison of Lu’s technique is checked on 15 different datasets from UCI repository. Experimental results depict the WSRC-based method outperforms most of the state-of-the-art techniques. Work related to recognizing the partially occluded faces within images is proposed by Mi et al. [25]. In Mi’s work after the division of images into blocks, the regression-based technique is used to find out the occluded blocks. In the end, sparse representation classification (SRC)-based strategy is used to detect the face. Comparison of Mi’s technique against extended SRC (eSRC) shows that SRC technique is good not only in terms of accuracy but also computationally less extensive. In another researcher, Gui et al. [30] presented a survey related to sparsity-based feature selection strategies. In Gui’s survey, mathematical representation and motivation behind the sparsity-based feature selection strategies have been discussed. Another work related to group sparse multiview patch alignment framework (GSM-PAF) method for the purpose of selecting and extracting feature by introducing sparsity within the projection matrix is proposed [27]. GSM-PAF technique shows good performance on real image classification tasks. In the proposed DST-TL work, sparse auto-encoder-based self-taught learning approach is used for the development of a reliable IDS.

An auto-encoder is a kind of neural network that tries to reproduce input to its output. Two steps are involved in the training of deep sparse auto-encoder. In the first step, unsupervised greedy layer-wise pre-training is performed, in which unlabeled data are used to extract useful features at each layer. Encoding and decoding are important steps involved in the pre-training of deep sparse auto-encoder. An auto-encoder tries to reproduce its input to its output, during the pre-training phase, and that is why, labeled data are not needed during pre-training phase. Let \(d_{\text{inp}}\) be the input to the encoder, which is encoded in the form of a function as shown in Eq. 1.

$${\text{en}}_{\text{inp}} = func(d_{\text{inp}} )$$
(1)

After encoding the input data, next phase is to decode the encoded results \(\left( {{\text{en}}_{\text{inp}} } \right)\) using Eq. 2.

$${\text{dec}}_{\text{inp}} = func({\text{en}}_{\text{inp}} )$$
(2)

Purpose of decoding is to reconstruct the original input. During the training phase, if auto-encoder only tries to reproduce its input to its output, then it may not be beneficial because it will lead to overfitting, and thus, results of unseen samples may be poor. To overcome the problem of overfitting during the training of individual auto-encoder, a sparsity term is introduced in the loss function which helps the auto-encoder to learn more generalized features during the pre-training phase.

A simple auto-encoder is comprised of an input layer, a hidden layer (for encoding the input data), and an output layer to reconstruct the original input. The loss function, used during pre-training of sparse auto-encoder, is described below.

$${\text{Loss}}\,{\text{function}} = \frac{1}{S}\sum\limits_{i = 1}^{I} {\sum\limits_{j = 1}^{J} {(d_{{{\text{inp}}_{ij} }} - {\text{dec}}_{{{\text{inp}}_{ij} }} )^{2} } } + \lambda_{L} \varOmega_{\text{weight}} + \beta_{L} \varOmega_{\text{sparsity}}$$
(3)

\(\left( {\frac{1}{S}\sum\nolimits_{i = 1}^{I} {\sum\nolimits_{j = 1}^{J} {(d_{{{\text{inp}}_{ij} }} - {\text{dec}}_{{{\text{inp}}_{ij} }} )^{2} } } } \right)\) is the mean squared error between decoded \(\left( {{\text{dec}}_{\text{inp}} } \right)\) and actual \(\left( {d_{\text{inp}} } \right)\) inputs. \(\varOmega_{\text{weight}}\) is weight regularization with a coefficient \(\lambda_{L}\), and \(\varOmega_{\text{sparsity}}\) is a sparsity regularization term with a coefficient \(\beta_{L}\).\(L_{2}\) weight regularization is expressed mathematically as:

$$\varOmega_{\text{weight}} = \frac{1}{2}\sum\limits_{l}^{L} {\sum\limits_{i}^{I} {\sum\limits_{j}^{J} {(W_{ij}^{l} )} } }^{2} .$$
(4)

In Eq. 4, L, I, and J show the number of hidden layers, instances, and variables, respectively, whereas W denotes weights of a link. \(L_{2}\) weight regularization term in the loss function helps to control the weights of the network and thus act as regularization term (learns optimal weights during training phase). The sparsity regularization term is expressed mathematically as:

$$\varOmega_{\text{sparsity}} = \sum\limits_{i' = 1}^{s'} {KL_{D'} } \left( {\frac{A'}{{A'_{i'} }}} \right) = \sum\limits_{i' = 1}^{s'} {A'\log \left( {\frac{A'}{{A'_{i'} }}} \right)} + (1 - A')\log \left( {\frac{1 - A'}{{1 - A'_{i'} }}} \right)$$
(5)

\(A_{i}^{'}\) shows the average activation value of a neuron i, whereas \(A^{'}\) depicts the desired activation value of \(i{\text{th}}\) neuron. The activation value of a particular neuron is controlled by regulating the weights of the network. If the desired and the average activation values of a neuron are the same, then the sparsity regularization term will be zero. The sparsity regularization term increases with an increase in the difference between \(A_{i}^{'}\) and \(A^{'}\) values.

To make the proposed network deep, the decoders are removed after the individual training of auto-encoder and the encoded input is provided as input to the next auto-encoder. When the required number of auto-encoders is trained, the next step is to stack the individually trained auto-encoders to form a feedforward network (with good initial weights). This feedforward network is then fine-tuned using backpropagation-based learning algorithm. In the proposed DST-TL technique, features are extracted by implementing only unsupervised pre-training phase (using unlabeled data). Sparsity within deep sparse auto-encoder helps to extract effective feature space and improves generalization performance of the trained network. Moreover, features extracted from the proposed DST-TL-based approach, when combined with the original features of NSL-KDD dataset, generally lead to a diverse feature space.

3.1 Deep neural network and adaptive self-taught-based transfer learning (DST-TL) for IDS

The basic idea behind the proposed technique is the use of an unsupervised feature extraction using adaptive self-taught learning for developing an efficient IDS. Results depict that the proposed DST-TL technique provides good generalization in terms of different performance measures. Deep sparse auto-encoder is used in the proposed DST-TL technique for extracting effective features through unsupervised learning. Figure 1 shows a block diagram of the proposed DST-TL methodology.

Fig. 1
figure 1

Block diagram of the proposed DST-TL methodology

3.2 Importance of transfer learning in the proposed DST-TL technique

Leveraging the knowledge gained from one type of machine learning problem and applying it to another machine learning task is known as transfer learning. The domain from where the knowledge is extracted is called the source domain. Extracted information from the source domain is either in the form of features or in the form of learning weights. Domain, in which the knowledge gained from the source domain is applied, is known as the target domain. Transfer learning can help in providing good results, particularly in case of deep neural networks where a lot of calibrated effort is required during training. There are several reasons that urge the machine learning experts to apply transfer learning. Some of the reasons are listed below:

  • Sometimes labeled data are not sufficient to train a network in the target domain.

  • A lot of effort (for tuning parameters) is required in the target domain to achieve remarkable results.

  • Training from scratch takes a lot of time.

In case of transfer learning, it is not always a necessary condition that target and source domains follow the same distribution. There are also some forms of transfer learning approaches in which the target and source tasks are not from the similar domain. Some of the transfer learning approaches are given below:

  • Self-taught learning

  • Multitasking

  • Domain adaptation

  • Unsupervised transfer learning

In past, various transfer learning-based approaches [31,32,33,34,35,36,37,38,39,40,41,42] have been reported by many different researchers for machine learning-related tasks. In the proposed DST-TL technique, self-taught learning-based robust IDS is developed.

3.2.1 Self-taught learning

Typically, in transfer learning, target and source domains follow the same data distribution, and in comparison with the target domain, a sufficient amount of data is available in the source domain. In 2007, Rania et al. [31] presented the idea of the self-taught learning, in which unlabeled data in the source domain help in improving the performance of classification-related task in the target domain. Moreover, target and source domain data may not follow the same distribution. Features thus extracted using unlabeled data from a source domain may assist to improve learning in the target domain task.

3.2.2 Exploiting adaptive self-taught learning using deep sparse auto-encoder

Proposed DST-TL technique is comprised of two phases. In phase 1, a pre-trained deep sparse auto-encoder (using unlabeled data) on regression-related task (wind power prediction in this case) is used as a transferable knowledge from source domain. Only first and last layers (using pre-training in target domain) are added to the already trained hidden layers (from source domain) to form a feedforward neural network. Feature extraction in phase 1 is performed by just passing the target domain data through trained feedforward network.

In phase 2, original features along with extracted features (from phase 1) are provided as an input to train the sparse auto-encoder. Sparse auto-encoder is trained (using pre-training) such that only those features are extracted from the input data on which softmax classifier shows good performance (on validation data). Performance of trained network is then evaluated on \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) test data. Figure 2 shows self-taught learning-based feature extraction methodology that is used in the proposed DST-TL technique.

Fig. 2
figure 2

Use of self-taught learning in training of deep sparse auto-encoder

3.2.3 Knowledge transfer from source to target domains

In self-taught learning, although the source task does not follow the same data distribution as that of the target domain, both domains are somehow related to each other. In the proposed DST-TL technique, source domain data are collected from wind farms located near Europe, and the task was to perform power forecasting against the provided features. As power forecasting at time t depends on the climatic condition of the previous hours; that is why, the predicted power and associated features of last 24 h are considered as input data to predict the power of time t. In short, in the source domain, the dataset is of the time series nature. Moreover, wind power prediction is a challenging task because any sudden fluctuation in the geographical and climatic conditions can affect the generated power. On the other hand, the target domain task in the proposed DST-TL technique is related to intrusion detection, in which NSL-KDD is used as benchmark dataset. Discussion related to the NSL-KDD dataset in Sect. 2 shows that traffic features are also of the time series nature and are calculated on the basis of connections in the past 2 s. When an intruder attacks the network, it results in the sudden change in the behavior of the Network flow, so any sudden change in the behavior of the network may be the sign of intrusion within the network. Network intrusion detection, just like wind power, is of unpredictable nature. Thus, the source domain task is related to target domain task in terms of the time series nature of input features and unpredictable behavior of power and intrusion.

In the proposed DST-TL technique, because of the commonality between the target and source domains, trained network from source domain is efficient enough to predict the wind power despite any abrupt change in the atmospheric behavior and thus share the gained knowledge to target domain.

4 Implementation details

All the experimental work related to the proposed work was performed on a desktop computer, having 16 GB RAM, Intel(R) Core(TM) i7-33770, 64-bit operating system, CPU@3.4 GHz. The operating system used was Windows 7 professional, and MATLAB 2016 a is used as a programming tool.

4.1 Parameter setting of the proposed technique

In order to set the parameters of deep sparse auto-encoder, 10% of training data were used for validating the optimal parameters. Table 4 shows the parameter setting of the deep sparse auto-encoder, in which hidden layers are pre-trained on source domain task. Table 5 illustrates the parameter setting during the training phase of sparse auto-encoder with (using original as well as extracted features from the source domain) or without (using only original features of NSL-KDD data) self-taught learning approach.

Table 4 Parameter setting of deep sparse auto-encoder (for extracting features) using hidden layers trained on source domain task
Table 5 Parameter setting of the sparse auto-encoder trained on the combined original and extracted features

4.2 Performance evaluation

To evaluate the proposed DST-TL technique, detection rate, false alarm rate, area under the ROC curve (AUC-ROC), area under the precision–recall curve (AUC-PR), and accuracy are used as evaluation measures. AUC-ROC is basically the plot of sensitivity and specificity, at different values of the thresholds. However, AUC-PR is the plot between precision and recall. Mathematically, the measures are defined in Eqs. 611.

$${\text{Sensitivity}} = {\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(6)
$$1 - {\text{Sepecificity}} = \frac{\text{FP}}{{{\text{FP}} + {\text{TN}}}}$$
(7)
$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(8)
$${\text{Detection}}\,{\text{Rate}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(9)
$${\text{False}}\,{\text{Alarm}}\,{\text{Rate}} = \frac{\text{FP}}{{{\text{TN}} + {\text{FP}}}}$$
(10)
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}} \times 100$$
(11)

In the above equations, TP and TN are the number of positive and negative class samples, respectively, that are correctly classified by the classifier. FP and FN are the number of negative and positive class samples, respectively, which are classified wrongly by the classifier.

5 Results and discussion

In the proposed DST-TL technique, after the training of a sparse auto-encoder, the performance of the trained model is evaluated on \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) datasets. To show the stability of the proposed DST-TL technique, performance in terms of detection and false alarm rates, AUC-ROC and AUC-PR, and accuracy for ten independent runs is shown in Tables 8, 9, and 10, respectively. Before checking the performance of proposed DST-TL technique, the performance of the conventional classifiers [multilayer perceptron (MLP), nonlinear principal component analysis (NLPCA), and deep belief network (DBN)] is evaluated on \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) datasets. MLP is a commonly used classifier that uses backpropagation learning algorithm during training. DBN is a type of classifier which is formed by stacking of independently trained RBM (restricted Boltzmann machine). After stacking of RBM, the backpropagation learning algorithm is used to fine tune the stacked feedforward network. Principal component analysis (PCA) is a commonly used feature reduction technique; however, NLPCA is considered as a generalized form of PCA. In this case, NLPCA comprises of a simple auto-encoder having an encoding layer that contains less neurons in comparison with the number of neurons in input layer. Features that are extracted from the hidden layer are provided as input to a simple classifier for classifying the input data. Table 6 shows the performance of MLP, DBN, and NLPCA in terms of all the evaluation measures. Parameters that are used during the training are provided in Table 7.

Table 6 Performance comparison of MLP, NLPCA, and DBN in terms of AUC-ROC, AUC-PR, and detection and false alarm rates
Table 7 Parameter setting of MLP, NLPCA, and DBN during training

5.1 Performance evaluation using detection and false alarm rates

Table 8 shows the comparison between detection and false alarm rates of a sparse auto-encoder with and without self-taught learning. The standard deviation of error for the ten independent runs is also shown. Table 8 illustrates that deep, sparse auto-encoder trained with self-taught learning approach shows better and more stable results (evident from the low values of standard deviation of error) in comparison with the deep sparse auto-encoder trained without self-taught learning.

Table 8 Performance comparison of the trained sparse auto-encoders for ten independent runs

5.2 Performance evaluation using AUC-ROC, AUC-PR, and accuracy

Table 9 shows the performance of the proposed DST-TL on \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) datasets in terms of AUC-ROC and AUC-PR. It can be observed that the self-taught learning approach increases stability and performance of network on unseen data and also has a low standard deviation of error. Figures 3 and 4 graphically illustrate the AUC-ROC and AUC-PR on \(KDDTest^{ + }\) and \(KDDTest^{ - 21}\) for ten independent runs.

Table 9 Performance comparison of trained sparse auto-encoders on test data
Fig. 3
figure 3

AUC-ROC of the proposed DST-TL technique for ten independent runs, a on \(KDDTest^{ + }\) and b on \(KDDTest^{ - 21}\)

Fig. 4
figure 4

AUC-PR of ten independent runs, a on \(KDDTest^{ + }\) and b on \(KDDTest^{ - 21}\)

Table 10 shows the accuracy of the proposed technique for ten independent runs. It is observed that features extracted through self-taught learning approach lead to the better training of the proposed DST-TL method. The self-taught learning approach helps in extracting robust features, which are then utilized by the deep sparse auto-encoder to learn optimal weights during training and thus increases the generalization performance.

Table 10 Performance comparison for ten independent runs of the trained sparse auto-encoders in terms of accuracy

5.3 Performance comparison of the proposed DST-TL technique with state-of-the-art techniques

Table 11 shows a comparison of the proposed DST-TL method with different classifiers (J48, naïve Bayes, NB tree, random forest, random tree, MLP, NLPCA, DBN, and SVM) and also with fuzzy-based semi-supervised technique. Table 11 depicts that the performance of the proposed DST-TL technique is better in comparison with existing methods, especially on \(KDDTest^{ - 21}\) dataset.

Table 11 Performance comparison of the proposed technique with the existing methods

6 Conclusion

A novel network IDS based on deep sparse auto-encoder that exploits self-taught learning is proposed. In the first phase, feature extraction is performed by passing the original feature set of NSL-KDD through the pre-trained network. Then, a combination of original and extracted features is used to train the sparse auto-encoder. The combined features improve the effectiveness of the feature space and thus increase the performance of the sparse auto-encoder on test samples. It is experimentally shown that the proposed DST-TL technique yields improved performance, although adaptation of the network trained on regression-related task is used to improve the performance of the IDS. The experimental comparison shows that the sparse auto-encoder trained on the improved feature space extracted through self-taught learning is more robust and stable in comparison with the sparse auto-encoder trained on the original feature space. Performance on test data shows that in comparison with previous techniques, the proposed DST-TL approach is robust and provides improved prediction accuracy. In future, we intend to apply deep neural networks, especially recent architectures of deep convolutional neural networks  for classifying different types of attacks [43].