Keywords

1 Introduction

Due to the recent advances in computer networks and data storage, many data are produced and accumulated at an ever increasing rate in the form of stream. Such as online shopping information, logistics information, online news, stock market data, emails, credit card transactions, etc. These data are real-time, continuous and orderly arrival, and need to be analyzed promptly and effectively. For example, in online mail systems, incoming emails need to be classified into different categories, like spams, business emails, personal emails, important emails, etc. This classification task, each stream example is associated with a single class label l from a set of labels \(L\left( |L|>1\right) \), is called single-label data stream classification, and has been extensively studied [3, 4, 11, 20, 33] in the literature.

In many emerging applications, each stream record may carry multiple class labels. A good example is news reports in the online news systems, most news reports carry multiple news topics (e.g. entertainment, financial and politics), then this is called Multi-Label data Stream Classification (MLSC) [21]. Formally, the multi-label stream classification problem is to training a model to attach a label subset \(Y\subseteq L\) to each instance in a high-speed data stream. Although, multi-label classification has been studied in traditional database mining scenarios, multi-label data stream classification is a relatively new concept and has not been fully addressed yet.

An intuitive approach to solving the multi-label classification problem is to transform it into one or more single-label classification problems. In this fashion, traditional single-label classifiers can be employed to make single-label classifications. Finally, the multi-label predictions can be produced by combining these multi-label predictions. On the other hand, an alternative category is to adopt the existing single-label classifiers directly to multi-label classification [18, 23].

The multi-label data stream environment has different challenges from the traditional batch learning setting. As the instances in a multi-label data stream contain multi-labels (multi-concepts), dealing with the concept drift is the most important challenge to a classifier. Another challenge with regard to MLSC is that, it is possible that an arriving example will belong to a set of labels, some of which, will not have been previously observed because of the dynamic nature of the set of labels [21]. Besides, the learner must be able to handle the stream using limited memory in real time, because stream data flood in continuously at a high speed, which makes it impossible to be stored in the memory and processed offline [14].

To address the above issues, in this paper we propose an efficient and effective ML-KNN-based ensemble model for multi-label stream classification with a balance AdjustWeight function, called Streaming Weighted ML-KNN-based Ensemble Classifier (SWMEC). More specifically, we first propose an ML-KNN based algorithm to build the basic classifier \(C_{i}\). The ensemble classifiers \(C=<C_{1},C_{2},\cdots ,C_{L}>\) will be build at the beginning of the stream with randomly selected L test data chunks. This only needs to compute and save a small amount of information of the cluster center points. Thus the building process is highly efficient while consuming constant memory space. In addiction, each classifier \(C_{i}\) has its own weight \(w_{i}\), and \(C=<\left( C_{1},w_{1}\right) ,\left( C_{2},w_{2}\right) ,\cdots ,\left( C_{L},w_{L}\right)>\). As data flow in, the weights will be adjusted and the classifier C will be updated according to the weights. Thus the proposed SWMEC approach can work adaptively to evolving data and deal with concept drifts, and can efficiently classify the incoming data in real-time.

The main contributions of this paper are as follows. (1) an adaption of the existing multi-label methods to evolving data streams. (2) an effective weighting adjustment strategy for ensemble classifiers. (3) experimental results validating the performance of our method and benchmarks in predictive performance and space complexity.

The remainder of this paper is organized as follows. In Sect. 2, we discuss relevant work in multi-label classification and stream classification. Section 3 presents the n preliminaries about the problem. Section 4, describes the proposed framework in details, and the experimental results are presented in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related Work

Our work is related to multi-label classification and stream classification techniques. We will briefly review the existing work on both of them.

Multi-label classification is the problem to deal with such instances that may belong to multiple different classes simultaneously and focuses on offline settings [16, 19]. Multi-label classification methods can be grouped into two categories, namely, problem transformation and algorithm adaptation [21]. Problem transformation methods transform the multi-label problem into multiple single-label problems. Problem transformation methods transform the multi-label problem into multiple single-label problems including Label Power-set (LP), Binary Relevance (BR) and Ranking by Pairwise Comparison (RPC [12]). Algorithm adaptation methods extend the traditional learning techniques to multi-label context, such as decision trees [5], neural networks [30], maximal margin methods [10], maximum entropy methods [25], and ensemble methods [25], etc. One well-known such approach is ML-KNN [31], which is derived from the popular lazy learning algorithm kNN. It’s the most relevant approach to our model and will be introduced in the next section.

Many studies have also been done on single-label stream classification. There are two sets of solutions: single-model based and ensemble based. Single-model based approaches [1, 2, 6, 9, 26, 27, 32, 33] use new data to incrementally update their model so that the model can scale to a large data volume. Ensemble based approaches [13, 22, 28], on the other hand, partition the data stream into equal sized chunks, and train multiple base models on different chunks of data. Then all the models are combined for prediction. The ensemble based approaches are easier to scale and parallelize, tend to achieve better accuracy and can also avoid over fitting than single classifier methods.

Recently, there are also some studies focusing on multi-label stream classification [14, 15, 17, 18, 21, 29]. Kong et al. [14] builds an ensemble of classifiers on successive data chunks. It proposes a random-tree based algorithm to improve it’s efficiency. Work also has been done on adopting the ensemble based strategy in handling multi-label streams [15, 29].

We follow a similar strategy to design our classification model with ML-KNN-based ensemble methods. Our model builds streaming classifiers by extending MLkNN, which is just designed for multi-label static data classification.

3 Preliminaries

We first introduce the notations that will be used throughout this paper, and then briefly describe the techniques ML-KNN [31] to make this paper self-contained.

Consider a multi-label stream S with a label set \(\mathcal {L}=\left\{ 1,2,\cdots ,q\right\} \), \(S\subseteq \mathbb {R}^{d}\). Stream S consists of infinite data chunks, \(\left\{ D_{1},\cdots D_{n},\cdots \right\} \), where labeled chunk \(D_{i}\) is denoted by \(D_{L}\) and unlabeled chunk denoted by \(D_{U}\). Each instance \(x\in S\) has a label subset \(Y=\left\{ y_{1},y_{2},\cdots ,y_{q}\right\} \in \left\{ -1,1\right\} ^{q}\), where \(Y\left[ j\right] ={\left\{ \begin{array}{ll} 1 &{} \text {if }y_{j}\in Y\\ -1 &{} \text {if }y_{j}\notin Y \end{array}\right. }\), \(\left( x_{i},Y_{i}\right) \) is an instance in the multi-label data stream. The task of multi-label stream classification based on ensemble solution is to train a classification model on the historical examples \(F\left( \cdot \right) \), \(F\left( \cdot \right) =g\left( f_{1}\left( \cdot \right) ,f_{2}\left( \cdot \right) ,\cdots ,f_{L}\left( \cdot \right) \right) \), where \(f_{i}(\cdot )\) is the sub-classifier, \(g\left( \cdot \right) \) is the combination function that combines the outputs of all \(f\left( \cdot \right) \), L is the number of classifiers. Then it uses \(F\left( \cdot \right) \) to predict a label set \(Y_{i}\) to the incoming data \(x_{i}\).

Table 1. Summary of major mathematical notations

3.1 ML-KNN [31]

We briefly describe our model’s basic approach ML-KNN, which is derived from the traditional k-Nearest Neighbor (kNN) algorithm and classify the traditional static multi-label data in a lazy learning way. There are three main steps in this approach. For convenience, several notations are summarized in Table 1. In addition, given a training set \(T=\left\{ \left( x_{1},Y_{1}\right) ,\left( x_{2},Y_{2}\right) ,\cdots ,\left( x_{n},Y_{n}\right) \right\} \left( x_{i}\in \mathbb {R}^{d},Y_{i}\in \mathcal {L}\right) \), t is the test instance, s is the smoothing parameter with a default value 1.

Step 1: Computing the prior probabilities \(P\left( H_{b}^{l}\right) \) according to Eqs. 1 and 2. Where \(l\in Y\) is the l-th label, and \(b\in \left\{ 0\text {,1}\right\} \), \(H_{1}^{l}\) represents the event the instance has label l. Conversely, \(H_{0}^{l}\) means the instance does not have label l.

$$\begin{aligned} P\left( H_{1}^{l}\right) =\left( s+{\displaystyle \sum _{i=1}^{n}L_{t}\left( l\right) }\right) /\left( s\times 2+n\right) \end{aligned}$$
(1)
$$\begin{aligned} P\left( H_{0}^{l}\right) =1-P\left( H_{1}^{l}\right) \end{aligned}$$
(2)

Step 2: Computing the posterior probabilities \(P\left( E_{j}^{l}\mid H_{b}^{l}\right) \) according to Eqs. 3 and 4. Where \(E_{j}^{l}\left( j\in \left\{ 0,1,\cdots ,k\right\} \right) \) denotes the event that, among the k nearest neighbors of t, there are exactly j instances which have label l. c[j] counts the number of training instances with label l whose k nearest neighbors contain exactly j instances with label l. Correspondingly, \(c^{'}[j]\) counts the number of training instances without label l whose k nearest neighbors contain exactly j instances with label l.

$$\begin{aligned} P\left( E_{m}^{l}\mid H_{1}^{l}\right) =\left( s+c\left[ j\right] \right) /\left( s\times \left( k+1\right) +\sum _{p=0}^{k}c\left[ p\right] \right) \end{aligned}$$
(3)
$$\begin{aligned} P\left( E_{m}^{l}\mid H_{0}^{l}\right) =\left( s+c^{'}\left[ j\right] \right) /\left( s\times \left( k+1\right) +\sum _{p=0}^{k}c^{'}\left[ p\right] \right) \end{aligned}$$
(4)

Step 3: Computing the output \(y_{t}\) (label subset) and \(r_{t}\) of t, where \(r_{t}\) is a real-valued vector calculated to rank labels in \(\mathcal {L}\), according to Eqs. 5 and 6.

$$\begin{aligned} \overset{\rightarrow }{y_{i}}\left( l\right)= & {} \arg \max _{b\in \left\{ 0,1\right\} }P\left( H_{b}^{l}\right) P\left( E_{C_{t}\left( t\right) }^{l}\mid H_{b}^{l}\right) \end{aligned}$$
(5)
$$\begin{aligned} \overset{\rightarrow }{r_{i}}\left( l\right)= & {} P\left( H_{1}^{l}\mid E_{C_{t}\left( l\right) }^{l}\right) \nonumber \\= & {} \left( P\left( H_{1}^{l}\right) P\left( E_{C_{t}\left( l\right) }^{l}\mid H_{1}^{l}\right) \right) /P\left( E_{C_{t}\left( l\right) }^{l}\right) \end{aligned}$$
(6)

The detailed architecture of ML-KNN was given in [31].

4 Weighted Ensemble Classification

In this section we first give the main idea of our weighted ensemble classification approach, then we introduce the process of the ensemble classifiers’ training and updating and the adjustment of their weights, finally we give the description of our algorithm.

4.1 Basic Idea

The data stream S is divided into a fixed number of chunks and each classification model in the ensemble is trained from a different chunk. Each classifier in the ensemble has it’s own weight. The new arriving unlabeled data chunk is classified by the ensemble while the corresponding weight will be changed. According to the weights, the latest classified data chunk will be decided if to be trained to generate a new model and replace one of the existing models in the ensemble. In this way the ensemble can be maintained at a fixed size and kept up-to-date. The problems of data stream’s infinite length and concept-drift can correspondingly be well addressed.

4.2 Classifier Training and Updating

The ML-KNN based multi-label ensemble classifier is built at the beginning of the data stream and timely updated over the data stream as follow: when a training data chunk in the data stream is arriving at time t, we build h clusters with the labeled data points by the application of XMeans technique. After the building of these clusters, we save each cluster’s centroid \(O=\{o_{1},o_{2},\cdots ,o_{h}\},o_{i}\in \mathbb {R}^{d},i\in \left\{ 1,2,\cdots ,h\right\} \), and compute each centroid’s label subset in the same way as ML-KNN. After that, all centroids and their label subset \(C=<\left( o_{1},y_{1},r_{1}\right) ,\left( o_{2},y_{2},r_{2}\right) ,\cdots ,\left( o_{h},y_{h},r_{h}\right) >,y_{j}\in Y\), \(y_{i}\),\(r_{j}\) are respectively the label subset and the a real-valued vector calculated by ML-KNN, will be saved as a summary. At the same time, the current summary’s arriving time t will also be recorded. Each summary’s weight w then can be calculated according to o, \(\overset{\rightarrow }{r}\) and t, and set to be 1 at beginning. After the process of L chunks, the ensemble classifier \(C=<\left( C_{1},w_{1}\right) ,\left( C_{2},w_{2}\right) ,\cdots ,\left( C_{L},w_{L}\right)>\) (each model \(C_{i}\) is a collection of h summaries and the number of h is unfixed, \(w_{i}\) is the weight of model \(C_{i}\)) will be built. When an unlabeled data chunk \(D_{U}\) is arriving, for each instance \(x_{i}\in D_{U}\), we find the \(m_{i}=\left( \left( o_{j},y_{j}\right) ,w_{i}\right) \in C_{i},j\in [1,2,\cdots h]\) whose centroid \(o_{j}\) is nearest from \(x_{i}\). The corresponding weight \(w_{i}\) will then be determined by the distance between \(o_{j}\) and \(x_{i}\), \(\overset{\rightarrow }{r_{i}}\) and \(t_{i}\). Then the \(x_{i}\) will be labeled according to the summaries \(\left\{ m_{1},m_{2},\cdots ,m_{L}\right\} \). Each unlabeled data chunk will be classified as above. After a chunk \(D_{U}\) has been handled by the ensemble C, If the lowest \(w_{lowest}\in w\) falls below a threshold value \(\epsilon \), the corresponding model \(C_{j}\) will be replaced by the new model that trained by \(D_{U}\). This ensures that the ensemble will be updated with the passing of data chunk and the number of models in the ensemble remains constant.

4.3 Ensemble Weighting

In this paper, inspired by [8], we propose a combination function \(g\left( \cdot \right) \) including three components (\(\overset{\rightarrow }{\alpha _{i}}\), \(\beta _{i}\) and \(\gamma _{i}\)) to combine classifiers in the ensemble. They are: (1) label confidence, which is a vector measuring the confidence in the sub-classifier outputting each label; (2) time difference; (3) distance difference, both (2) and (3) describe how confident a sub-classifier is when making a classification decision.

We estimate \(\overset{\rightarrow }{\alpha _{i}}\) from the ensemble model C by Eq. 7:

$$\begin{aligned} \overset{\rightarrow }{\alpha _{i}}=\frac{\overset{\rightarrow }{r_{i}}}{\sum _{i=1}^{L}\overset{\rightarrow }{r_{i}}} \end{aligned}$$
(7)

where \(\beta _{i}\), \(\gamma _{i}\) describe how confident a sub-classifier is when making a classification decision. \(\beta _{i}\) is estimated by the time difference between arriving new data chunk \(D_{U}\) and the sub-classifier \(C_{i}\). \(\triangle t_{i}=t_{D_{U}}-t_{C_{i}}\), where \(t_{D_{U}},\) \(t_{C_{i}}\) are the arriving times of data chunk \(D_{U}\) and \(D_{i}\) respectively. The longer the chunks are apart, the lower the \(\beta \) is.

$$\begin{aligned} \beta _{i}=\frac{e^{-\triangle t_{i}}}{\sum _{i=1}^{L}e^{-\triangle t_{i}}};0<\beta _{i}<1,\sum _{i=1}^{L}\beta _{i}=1 \end{aligned}$$
(8)

\(\gamma _{i}\) is estimated by the distance difference between \(x_{i}\) which is the instance from the arriving new data chunk \(D_{U}\) and it’s nearest center \(o_{j}\) in sub-classifier \(C_{i}\).

$$\begin{aligned} \gamma _{i}=\frac{e^{-d_{i}}}{\sum _{i=1}^{L}e^{-d_{i}}};0<\gamma _{i}<1;\sum _{i=1}^{L}\gamma _{i}=1 \end{aligned}$$
(9)

where \(d_{i}=distance\left( x_{i},o_{j}\right) \), the closer the distance between \(x_{i}\) and \(o_{j}\) the more confident the sub-classifier.

Finally, we combine \(\overset{\rightarrow }{\alpha _{i}}\), \(\beta _{i}\) and \(\gamma _{i}\) to decide the weight \(w_{i}=\overset{\rightarrow }{\alpha _{i}}\times \beta _{i}\times \gamma _{i}\). Consequently:

$$\begin{aligned} g\left( \cdot \right) =\sum _{i=1}^{L}\left( f_{i}\left( x_{i}\right) \times \overset{\rightarrow }{\alpha _{i}}\times \beta _{i}\times \gamma _{i}\right) \end{aligned}$$
(10)

The output \(Y=\left\{ y_{1},y_{2},\cdots ,y_{q}\right\} \in \left\{ -1,1\right\} ^{q}\),

\(Y\left[ j\right] ={\left\{ \begin{array}{ll} 1 &{} \text {if }g\left( \cdot \right) [j]=\sum _{i}^{L}\left( f_{i}\left( \cdot \right) [j]\times \alpha _{i}\times \beta _{i}\times \gamma _{i}\right) >0\\ -1 &{} \text {if }g\left( \cdot \right) [j]=\sum _{i}^{L}\left( f_{i}\left( \cdot \right) [j]\times \alpha _{i}\times \beta _{i}\times \gamma _{i}\right) <0 \end{array}\right. }\).

4.4 The Classification Algorithm

The pseudo code of our method for classifying multi-label data streams using the ML-KNN-based ensemble method is given in Algorithm 1.

figure a

5 Experiments

In this section, we show the results of several experiments performed to evaluate the effectiveness of our proposed SWMEC for classification in real-world multi-label data streams TMC2007 [24] (see Table 2 in detail). All the experiments are conducted on a PC with Intel(R) Core(TM) i3-3220 3.30 GHz CPU and 4 GB RAM. We have implemented SWMEC in python2.7. Most importantly, in order to get more precise classification results, we firstly preprocess the data set. We remove the infrequent words that occur in less than 11% of the documents as [14] did.

Table 2. Summary of dataset TMC2007

A. Classification quality comparison with ML-KNN

We compare the performance of our approach SWMEC against ML-KNN [31] as the base-line of our model. In order to apply this one of the state-of-the-art methods in offline context to the stream classification, we train a ML-KNN classifier on the latest chunk of data, and use it to classify the next chunk of data, which is similar to sliding window approaches. Since the data chunk size is the most important factors in the classification and training process, here we change the chunk sizes (sliding window sizes) to test the performance of basic ML-KNN. The Parameters are set as follows: SWMEC: L (the size of the ensemble classifier model) \(=4\), k (the number of nearest neighbors in ML-KNN) \(=4\), \(\epsilon \) (the threshold about weight’s adjustment) \(=0.001\). ML-KNN (\(w=100\)): w (the size of the window) \(=100\). ML-KNN (\(w=200\)): w (the size of the window) \(=200\). ML-KNN (\(w=400\)): w (the size of the window) \(=400\).

Multi-label classification problems has many different metrics for evaluation. Such as Hamming-loss, F-measure, Log-Loss, Ranking-Loss [17]. Here we adopt two metrics to evaluate multi-label classification performance in a data stream. First, Micro F1: considers both micro average of Precision and Recall with equal importance, evaluates a classifier’s label set prediction performance. The higher the value, the better the performance.

$$ MicroF1\left( f_{i},D_{L}\right) =\frac{2\times \sum _{i=1}^{|D_{L}|}|f_{i}\left( x\right) \cap Y_{i}|}{\sum _{i=1}^{|D_{L}|}|f_{i}\left( x\right) |+\sum _{i=1}^{|D_{L}|}|Y_{i}|} $$

where |D| is the number of instances in a multi-label data stream D, which contains \((x_{i},Y_{i})\), where \(Y_{i}\subseteq L(i=1,\cdot \cdot \cdot ,|D|)\), \(f(x_{i})\subseteq L\) denotes a multi-label classifier’s predicted label set for \(x_{i}\) and \(f(x_{i},k)\) denotes the classifier’s probability outputs for \(x_{i}\) on the k-th label (\(l_{k}\)).

Second, Ranking Loss [7]: compute the average number of label pairs that are incorrectly ordered given \(Y_{i}\) weighted by the size of the label set and the number of labels not in the label set. Evaluates the performance of classifier’s probability outputs or real-value outputs \(f(x_{i},k)\). The best performance is achieved with a ranking loss of zero.

$$ RankLoss(f,D)=\frac{1}{|D|}\sum _{i=1}^{|D|}\frac{1}{|Y_{i}||\bar{Y_{i}}|}loss\left( f,x_{i},Y_{i}\right) $$
$$ loss(f,x_{i},Y_{i})=\sum _{k\in Y_{i}}\sum _{k^{'}\in \bar{Y_{i}}}I\left( f\left( x_{i},k\right) \le f\left( x_{i},k^{'}\right) \right) $$

where the \(\bar{Y_{i}}\) denotes the complementary set of \(Y_{i}\) in \(\mathcal {L}\).

Fig. 1.
figure 1

SWMEC against ML-KNN on TMC2007 dataset.

Figure 1(a) shows the Micro F1 for the four algorithm throughout the stream in TMC2007 data-set. We report the average performance on every |D| / 10 instances. For example, at X axis \(=5\), the Y values show the average Micro F1 of four classification models from the \(|D|*4/10\)th instance of the stream to the \(|D|*5/10\)th instance. At this point, the Micro F1 of SWMEC is 0.3729, the Y values in ML-KNN (\(w=100\)), ML-KNN (\(w=200\)) and ML-KNN (\(w=400\)) are all below SWMEC.

We also calculate the Ranking loss of four classification models. Figure 1(b) show the Ranking loss of the four algorithms throughout the stream in TMC2007 data-set. For example, at X axis = 6, the ranking loss of SWMEC is 0.1571, the Y values in ML-KNN (\(w=100\)), ML-KNN (\(w=200\)) and ML-KNN (\(w=400\)) are all higher than SWEMC.

B. Classification quality comparison with SMART [14]

We also compare the performance of our approach SWMEC against SMART [14], which also adopt the strategy of ensemble and gives us a great inspiration in this paper. As Fig. 2(a) and (b) shows, Our method SWMEC is slightly better than SMART in Micro F1 and Ranking Loss. Especially noteworthy is that SMART has a much bigger space overhead than SWMEC, because SMART needs to maintain several tree structures while SWMEC only needs to store small amount of central points. In the future, we would like to combine both of these two methods’ advantages to address the multi-label classification problem in data streams.

Fig. 2.
figure 2

SWMEC against SMART on TMC2007 dataset.

6 Conclusion

This paper presents an efficient algorithm for multi-label data stream classification based on ML-KNN. As the properties of data stream and multiple labels assigned to each instances. It becomes more challenging than the traditional static multi-label data classification problems and single-label data stream classification problems. To address these challenges, we propose an ensemble multilabel data stream classification approach, manly Streaming Weighting ML-KNN based Ensemble Classifier (SWMEC), to efficiently update the model with the multi-label data stream flows. Then our model can effectively and efficiently predict multiple labels for future data points. The experimental results on the real world validate that our multi-label data stream classification approach is very effective and efficient for multi-label stream classification.