Keywords

1 Introduction

Online machine learning, characterized by its real-time learning and adaptive capabilities [14], is increasingly emerging as a critical component in a wide range of applications, including river flow prediction and water quality assessment. The development of robust and efficient online machine learning algorithms is important, considering their aptitude for handling continuous data streams and adapting to evolving data patterns.

Despite the limited attention received by online multi-output regression, it is noteworthy that in an increasing number of applications, the need to predict multiple outputs rather than a single output is becoming more prevalent. Multi-output regression poses a greater challenge as it involves handling potential structural relationships between the outputs. Nevertheless, online multi-output regression holds significant potential in a variety of applications, e.g., Train Carriage Load prediction [22], water quality prediction [6].

Typically, multi-output regression can be categorized into global and local methods. Global methods output all output variables at once, while local methods often consist of multiple sub-models. However, existing multi-output regression algorithms are designed to learn a mapping from the input space to the output space on the entire training dataset, making them suitable for batch processing environments but challenging to apply in environments requiring online model updates.

Several online learning algorithms have been proposed [1, 3, 5, 8, 11, 15, 23] with Incremental Structured Output Prediction Tree (iSOUP-Tree) [15] and Passive-Aggressive (PA) [3] algorithms being notable examples. iSOUP-Tree is a tree-based method that simultaneously predicts all targets in multi-target regression tasks on data streams. On the other hand, PA is particularly suitable for single-output tasks and quickly adapts to data changes and updates model parameters analytically. It balances proximity to the current classifier and achieves a unit margin on the latest example. However, both algorithms face challenges in effectively capturing and exploiting correlations in multi-output regression tasks.

In this paper, we aim to address this problem by introducing a novel online multi-output regression algorithm called “Correlated Online k-Nearest Neighbors Regressor Chain” (CONNRC). Our proposed algorithm utilizes a maximum correlation chain structure to capture the associations among output variables while also leveraging the strengths of the k-nearest neighbors (kNN) algorithm.

The main contributions of this paper are as follows:

  • We propose a novel online multi-output regression algorithm, CONNRC, which extends the kNN algorithm to handle online multi-output regression tasks. This algorithm leverages a maximum correlation chain structure to capture the association between output variables, thus addressing the challenge of handling potential structural relationships between the outputs in multi-output regression tasks.

  • We comprehensively evaluate the proposed CONNRC algorithm on six real-world datasets. The evaluation includes a comparison with the existing online multi-output regression algorithm, iSOUP-Tree [15], and several classic online learning algorithms, including Hoeffding Adaptive Tree (HAT) [1], Adaptive Model Rules (AMRules) [5], Stochastic Gradient Trees (SGT) [8], and PA [3]. Our results demonstrate that CONNRC consistently outperforms other state-of-the-art algorithms in terms of average Mean Absolute Error (aMAE), signifying its superior accuracy in online multi-output regression tasks.

2 Related Work

The related work for this study can be broadly categorized into two main areas: online multi-output regression and classic online learning methods.

2.1 Online Multi-output Regression

Online multi-output regression is a crucial technique for modeling, predicting, and compressing multi-dimensional correlated data streams. The MORES method [11] dynamically learns the structure of regression coefficients and residual errors to improve prediction accuracy. It introduces three modified covariance matrices to extract necessary information from all seen data for training and sets different weights on samples to track the evolving characteristics of data streams. The iSOUP-Tree method [15] learns trees that predict all targets simultaneously. The iSOUP-optionTree extends the iSOUP-Tree through the use of option nodes and can be used as a base learner in ensemble approaches like online bagging and online random forest. The MORStreaming method [23] is another notable online multi-output regression method. It uses an instance-based model to make predictions and consists of two algorithms: an online algorithm based on topology networks to learn the instances and an online algorithm based on adaptive rules to learn the correlation between outputs automatically.

2.2 Classic Online Regression

Classic online learning methods have been extensively studied and applied in various domains. These methods typically focus on single-output regression tasks. The PA [3] is a well-known online learning algorithm for regression tasks. It is a margin-based online learning algorithm and has an analytical solution to update model parameters as new sample(s) arrive. The HAT [1] is a tree-based method for online regression. It uses an ADWIN concept-drift detector at each decision node to monitor possible changes in the data distribution. If a drift is detected in a node, an alternate tree begins to be induced in the background. When enough information is gathered, the node where the change was detected is swapped by its alternate tree. Lastly, SGT [8, 12] is a tree-based method for regression. It directly minimizes a loss function to guide tree growth and update their predictions, differentiating it from other incremental tree learners that do not directly optimize the loss but a data impurity-related heuristic. AMRules [5] is an online regression algorithm that constructs rules with attribute conditions as antecedents and linear combinations of attributes as consequents. It employs a Page-Hinkley test to identify and adapt to changes in the data stream, and includes outlier detection to prevent model skewing due to anomalous examples.

3 Proposed Method

The Correlated Online k-Nearest Neighbors Regressor Chain (CONNRC) algorithm, inspired by the Regressor Chain (RC) [21] and its variants [13, 17, 19, 20], is designed for online multi-output regression tasks. It uses a chain structure to leverage correlations among multiple output variables, improving prediction performance. While the kNN algorithm forms its basis, CONNRC enhances it by incorporating correlation information, addressing a limitation of traditional online kNN in online multi-output regression tasks.

Throughout this paper, we use the following notation: a vector of output variables \({\boldsymbol{y}} = \left( y_1, \ldots , y_t, \ldots , y_N \right) \), where \(t=1,\ldots , N\), with \(y_t\) representing the t-th sample and N being the total number of samples. \(\textbf{Y}\) denotes a matrix of output variables, \(\textbf{Y}=\left( {\boldsymbol{y}}_1, \ldots , {\boldsymbol{y}}_m\right) \), where \({\boldsymbol{y}}_i\) is the i-th vector of output variables and m is the total number of output variables. \(\textbf{X}\) represents the input feature space, which is a matrix with N samples and d feature variables. The correlation coefficient of \({\boldsymbol{y}}_i\) and \({\boldsymbol{y}}_j\) can be denoted as \(c_{i,j}\), \( c_{i,j} = {\text {corr}}({\boldsymbol{y}}_i,{\boldsymbol{y}}_j)\). This is represented as \(\textbf{COE}_{m \times m}={\text {corr}}(\textbf{Y})\), where \(\textbf{COE}\) is the correlation matrix of size \(m \times m\), with m being the number of output variables. By summing each row of \({\boldsymbol{y}}_i\), we obtain the cumulative correlation value \(c_i\), which indicates the degree of association between \({\boldsymbol{y}}_i\) and all other variables in \(\textbf{Y}\). The correlation among output variables is calculated using the Spearman correlation coefficient, which has the advantage that there is no requirement for the distribution of the data, and the effect of the size of the absolute values can be ignored. \({\boldsymbol{\textrm{I}}}\) represents the index vector corresponding to \({\boldsymbol{c}}_l\) that is the order of meta-model in the chain structure. It is the descending rank of each output variable cumulative correlation value \(c_i\), \({\boldsymbol{I}} = (I_1, \ldots , I_m)\). This information serves as a guide for the order in which the kNN models are updated, based on the assumption that the output variables with higher cumulative correlation are more informative and should be predicted first.

A kNN model is initialized for each output variable, and a window of recent data points is maintained. The maximum window size, w, is a parameter that can be adjusted based on the specific requirements of the task. The window always contains the most relevant data points for prediction, which is ensured by a mechanism that adds the current point to the window if its distance to the nearest neighbor is greater than a certain threshold, \(min\_distance\_keep\), and removes the oldest point in the window if the window is full.

For each training example in the stream, \(({\boldsymbol{x}},{\boldsymbol{y}})\), the kNN model for each output variable is updated. For the first output variable, the model kNN1 is updated with the input features and the corresponding output value \(({\boldsymbol{x}},y_1)\). For subsequent output variables, the model kNNi is updated with the input features and the predicted values of the previous output variables \(({\boldsymbol{x}},y_i)\).

After updating the model, the \(k_{neighbors}\) nearest neighbors for the current point \(({\boldsymbol{x}},y_i)\) are computed from the window. If the distance to the nearest neighbor is greater than a certain threshold, the current point is added to the window. If the window is full, the oldest point in the window is removed. This process ensures that the window always contains the most relevant data points for prediction. The pseudocode of CONNRC is illustrated in Algorithm 1.

figure a

In this algorithm, the historical dataset is denoted as \(D = (\textbf{X}_{history}, {\boldsymbol{y}}_{history})\). For the current experiment, we utilize the first 25% of the entire dataset as our training data.

The CONNRC algorithm has several advantages. First, it considers the correlation among multiple output variables by correlated chain structure, which can improve the prediction performance in multi-output regression tasks. Second, it is based on the simple yet effective kNN algorithm, which makes it easy to implement and understand. It is worth noting that CONNRC consists of two hyperparameters: the maximum window size (w) and the threshold for adding points to the window (\(min\_distance\_keep\)). These hyperparameters need to be tuned. Furthermore, the computation of the \(k_{neighbors}\) nearest neighbors can be time-consuming, especially when the window size is large.

4 Experiment Framework

In this section, we present the experimental framework used to evaluate the performance of our proposed method, CONNRC, in comparison with several state-of-the-art online regression algorithms. All experiments were implemented in Python 3.9 and executed on a PC with an Intel Core i7 12700 processor (4.90 GHz) and 16 GB of RAM.

4.1 Datasets

Our proposed method, CONNRC, underwent evaluation on six real-world datasets, each exhibiting unique characteristics and posing specific challenges. These datasets were chosen based on their previous utilization in online multi-output regression research [4, 15] and their role as benchmark datasets in the field of multi-output regression [21]. The datasets used in our evaluation are as follows:

  • Water Quality Prediction [6]: It comprises 16 input attributes that are associated with physical and chemical water quality parameters. Additionally, it includes 14 target attributes that represent the relative presence of plant and animal species in Slovenian rivers.

  • Supply Chain Management Prediction [9, 16]: It is derived from the Trading Agent Competition in Supply Chain Management (TAC SCM) tournament from 2010. This consists of two sub-datasets. The dataset consists of 16 regression targets, where each target corresponds to either the mean price for the next day (SCM1D) [9] (as the first sub-dataset) or the mean price for a 20-day period in the future (SCM20D) [16] (as the second sub-dataset) for each product within the simulation.

  • River flow Prediction [21]: This dataset consists of hourly flow observations collected from 8 sites within the Mississippi River network in the United States. The data was obtained from the US National Weather Service. Each row of the dataset contains the most recent observation for each of the eight sites and time-lagged observations from 6, 12, 18, 24, 36, 48, and 60 h in the past.

  • Energy Building Prediction [18]: The focus of this dataset is the prediction of heating load and cooling load requirements for buildings. The prediction is based on eight building parameters, including glazing area, roof area, overall height, and other relevant factors.

  • Sea Water Quality Prediction [10]: The Andromeda dataset focuses on predicting future values for six water quality variables, namely temperature, pH, conductivity, salinity, oxygen, and turbidity. The dataset pertains specifically to the Thermaikos Gulf of Thessaloniki in Greece. The predictions are made using a window of five days and a lead time of five days, indicating that the model aims to forecast the values of these variables five days ahead based on the observations from the previous five days.

These datasets represent a variety of application domains and provide a comprehensive basis for evaluating the performance of the proposed method.

4.2 Experiment Settings

To evaluate the performance of each multi-output method, we employed the prequential strategy introduced in [7]. In this strategy, a sample is used to update the model after it has been evaluated by this model. This approach allows us to simulate a real-time learning environment, which is a key characteristic of online machine learning. We also employed progressive validation, as suggested in [2]. This method is considered the canonical way to evaluate a model’s performance, as it allows us to accurately assess how a model would have performed in a production scenario. In progressive validation, the dataset is transformed into a sequence of queries and responses. For each step, the model is tasked with predicting an observation or undergoing an update, and the samples are processed one after the other.

In our comparison, we assessed the performance of our proposed algorithm, CONNRC, in comparison to the state-of-the-art iSOUP-Tree, as well as several classic online learning algorithms: PA, AMRules, HAT, and SGT. It’s important to note that our task involves multi-output regression, while these classical online learning algorithms were originally designed for single-output tasks. We trained separate models for each output variable to adapt them to the multi-output setting, denoting them with the prefix “MT-”. The reason for choosing single-output classical online learning algorithms as baselines is that online multi-output regression currently receives little attention, and few open-source algorithms are available for this purpose. These baseline algorithms represent a diverse range of approaches to online regression, providing a comprehensive benchmark for evaluating the performance of our proposed method, CONNRC.

4.3 Performance Evaluation

To evaluate the overall performance of the model, we conducted performance evaluations at regular intervals, incrementally increasing the data sample by 25%. This approach allowed us to monitor the model’s performance as the dataset size grew and gain a comprehensive understanding of its effectiveness across various sample sizes. In our evaluation, we considered several metrics to assess the performance of the model, including average Mean Absolute Error (aMAE), memory usage, and time. It is important to note that smaller values for all three metrics indicate better performance. The aMAE can be calculated by the following equation:

$$\begin{aligned} \textrm{aMAE}=\frac{1}{m} \sum _m^{t=1} \frac{1}{N} \sum _N^{i=1}\left| y_i^t-\widehat{y}_i^t\right| . \end{aligned}$$
(1)

5 Experiment Results and Discussion

Figure 1 illustrates the accuracy performance of the algorithms based on the average Mean Absolute Error (aMAE). Our proposed method, CONNRC, consistently outperforms the other algorithms by achieving the lowest aMAE values across all sample sizes and datasets. The performance of CONNRC remains stable and promising at every regular checkpoint, whereas the performance of the other algorithms fluctuates. This underscores the effectiveness of CONNRC in capturing and leveraging the correlations among multiple outputs, resulting in enhanced prediction accuracy.

Fig. 1.
figure 1

Comparison of aMAE by different algorithms at regular intervals of 25% data sample incremental

While CONNRC demonstrates superior accuracy compared to other methods, it also exhibits the highest time consumption across all datasets, as depicted in Fig. 2. This can be attributed to the time complexity associated with searching for the nearest neighbors in the kNN algorithm, indicating a potential area for future improvement. Nevertheless, it is important to note that CONNRC still maintains reasonable efficiency in certain cases, suggesting a balance between accuracy and computational speed.

Fig. 2.
figure 2

Comparison of computational time by different algorithms at regular intervals of 25% data sample incremental

It is worth noting that although MT-PARegressor demonstrates the poorest accuracy in 4 out of 6 datasets, it stands out as the fastest algorithm. This highlights the potential trade-off between accuracy and computational efficiency when selecting an algorithm for specific applications.

Figure 3 illustrates the memory usage comparison among the algorithms. It is evident that MT-PARegressor exhibits the highest efficiency in terms of memory consumption across all datasets. In contrast, in most cases, CONNRC and MT-SGTRegressor tend to consume more memory. However, it is worth noting that CONNRC demonstrates reasonable efficiency, particularly in memory usage, on certain datasets, such as the River Flow dataset.

Fig. 3.
figure 3

Comparison of memory usage by different algorithms at regular intervals of 25% data sample incremental

One key difference between the online kNN algorithm and other online learning algorithms lies in their approach to handling data. While both algorithms learn from each sample once, the kNN algorithm retains the data, allowing its performance to be comparable to batch algorithms that have the ability to review each sample multiple times. On the other hand, typical online learning algorithms, which do not store data, often do not achieve the same level of robustness and accuracy as batch algorithms. In the case of CONNRC, our proposed method, we leverage the advantage of data retention inherited from kNN. Additionally, CONNRC incorporates a relevance chain structure to capture and exploit the correlations among structured multi-outputs. Moreover, the neighboring meta-models in CONNRC utilize the output of the preceding meta-model in the chain as latent information. Similar to knowledge distillation, this process enhances CONNRC’s robustness against concept drift, allowing it to adapt to changing data patterns.

6 Conclusion and Future Work

This paper introduced and evaluated CONNRC, an online multi-output regression algorithm, using six different datasets. Our experimental results demonstrated that CONNRC consistently outperformed other algorithms in terms of accuracy, as measured by the aMAE metric. This indicates that CONNRC is capable of providing more accurate predictions in online multi-output regression tasks compared to the evaluated algorithms. However, it is important to note that the time performance of CONNRC was relatively slower compared to the other algorithms. This could be a potential limitation in scenarios where real-time predictions are crucial and time efficiency is a primary concern. While CONNRC demonstrated reasonable efficiency in terms of memory usage, it was not the most memory-consuming among the tested algorithms.

Despite these limitations, the superior accuracy performance of CONNRC underscores its potential for online multi-output regression tasks. The algorithm’s ability to adapt to changes in the data stream and make accurate predictions is a significant advantage in many real-world applications, such as environmental monitoring, energy efficiency prediction, and water quality prediction.

For future work, we have plans to improve the efficiency of CONNRC, particularly its time performance. One potential approach is to enhance the nearest neighbor search in the kNN component of the algorithm, as it currently constitutes the most time-consuming part. Through optimization of this process, we aim to significantly reduce the prediction time, making CONNRC more suitable for real-time applications. Given the potential application scenarios involving massive streaming data, we will seek large datasets to validate the performance of our algorithms in such environments.