5.1 Data Selection Methods

When determining the data selection method and calculating the performance, there is the greatest difference between BML and OML. Among other things, in OML the resources (memory and time, but not the data) are severely limited. In addition, Cross Validation (CV) is not possible. It is very important to determine which instances are used for training and for testing (and possibly also for validation).

For each of the selection approaches presented in the following, a metric must be selected, e.g., accuracy or Mean Absolute Error (MAE).

5.1.1 Holdout Selection

In the holdout evaluation method, the performance of the model is evaluated against a test data set, which consists of examples that have not yet been sighted. These examples are used only for evaluation purposes and not for the training of the model.

Definition 5.1

(Holdout) In the holdout evaluation method, the performance is evaluated after each batch, i.e., after a certain number of examples or observations. For this purpose, two parameters must be defined:

  1. 1.

    Size of the (holdout-) window and

  2. 2.

    frequency of testing.

The holdout evaluation is best when current and representative holdout data are used.

Why are holdout data not always used for OML? It is not always easy or even possible to obtain these data. In addition, the holdout data set must be representative, which cannot be guaranteed with streaming data due to possible changes. The holdout data of today can already be outdated tomorrow. If the period in which the holdout data are collected is too short, these data may contain essential relationships.

5.1.2 Progressive Validation: Interleaved Test-Then-Train

In statistics, progressive validation is generally understood to be the validation over a longer period of time, e.g., by using control charts. In the streaming data context, the term is used for approaches in which the individual instances are first used for testing (determining the quality of the model, the model calculates a prediction) and then for learning (training the model). Each individual instance is analyzed according to its arrival order. In addition to simple progressive validation, we also consider prequential validation and delayed progressive validation.

5.1.2.1 Progresssive Validation

Definition 5.2

(Progressive Validation) Each observation can be denoted as \((X_t, y_t)\), where \(X_t\) is a set of features, \(y_t\) is a label (or a prediction value), and t denotes the time (or simply the index). Before updating the model with the pair \((X_t, y_t)\), the model calculates a prediction for \(X_t\), so that \(\hat{y}_t\) is calculated. Using the ground truth \(y_t\) and the predicted value \(\hat{y}_t\) from the model, the online metric is then updated. Common metrics such as accuracy, MAE, Mean Squared Error (MSE), and Area Under The Curve, Receiver Operating Characteristics (ROC, AUC) are all sum values and can therefore be updated online.

This procedure can also be used for time series: If there are t observations \((x_1, x_2, \ldots , x_t)\), then the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\) can be used as \(X_t\) and the value \(x_t\) as \(y_t\). Alternatively, additional features can be calculated from the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\), which are then used as \(X_t\). Typical features are the information about the day of the week or the season.

5.1.2.2 Prequential Validation

Definition 5.3

(Prequential Validation) Prequential validation works like progressive validation (interleaved test-then-train). However, the new instances are more important than the old ones. This is implemented by a sliding window or a decay factor.

5.1.2.3 Delayed Progressive Validation

Typically, an OML model calculates a prediction \(\hat{y}_t\) and then learns. This was referred to as “progressive validation” in Sect. 5.1.2. The prediction and the observed value can be compared to measure the correctness of the model. This approach is often used to evaluate OML models. In some cases, this approach is not appropriate, because the prediction and the ground truth are not available at the same time. In this case, it makes sense to delay the process until the ground truth is available. This is called delayed progressive validation.

Delayed Progressive Validation

While evaluating a machine learning model, the goal is to simulate production conditions to get a trustworthy assessment of the model’s performance. For example, consider the number of bicycles needed for a bike rental for the next week. Once 7 days have passed, the actual demand is known, and we can update the model. What we really want is to evaluate the model by, for example, forecasting seven days in advance and only updating the model when the ground truth is available (Grzenda et al., 2020).

The delayed progressive validation is of great importance for practice: Instead of updating the model immediately after it has made a prediction, it is only updated when the ground truth is known. In this way, the model more accurately reflects the real process.

5.1.3 Machine Learning in Batch Mode with a Prediction Horizon

The method eval_bml_horizon implements the “classical” BML approach: The classical BML algorithm is trained once on the training data set, resulting in a model, say \(M^{(1)}_{\text {bml}}\), which is not changed: \(M_{\text {bml}}^{(1)} = M_{\text {bml}}\).

The model \(M_{\text {bml}}\) is evaluated on the test data, where the horizon, say \(h \in [1, s_{\text {test}}]\), comes into play: h specifies the size of the partitions into which \(D_{\text {test}}\) is divided. If \(h=s_{\text {test}}\), then the standard procedure of Machine Learning (ML) (“train-test”) is implemented. If \(h=1\), a pure OML-setting is simulated. The OML procedure is only simulated in this case, since the model \(M_{\text {bml}}\) is not updated or retrained. The BML approach is shown in Fig. 5.1.

Fig. 5.1
A set of 4 arrays has D train in the first cell with D test in the second, third, and fourth cells in the first 3 arrays. The final array has 2 cells of D train and D test where D test has a bigger size.

Batch method with a prediction horizon. The training data set \(D_{\text {train}}\) is used once. The model \(M_{\text {bml}}\) trained on \(D_{\text {train}}\) is tested on the individual partitions of the test data set \(D_{\text {test}}\) one after the other. The lower figure shows (as a special case) the data sets when a classical holdout approach is used. In this case, the size of the test data set is equal to the size of the horizon

If the entire test data set is used for the prediction horizon in the batch method, i.e., \(s_{\text {test}} = h\), then we obtain the classical holdout approach (see Sect. 5.1.1).

5.1.4 Landmark Batch Machine Learning with a Prediction Horizon

The method eval_bml_landmark implements a landmark approach. The first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are different: After a prediction with \(M_{\text {bml}}^{(1)}\) for the batch of data instances from the interval \([s_{\text {train}}, s_{\text {train}} + h]\) has been calculated, the algorithm is retrained on the interval \([1, s_{\text {train}} + h]\) and an updated model \(M_{\text {bml}}^{(2)}\) is available. In the third step of the landmark BML, \(M_{\text {bml}}^{(2)}\) calculates predictions for \([s_{\text {train}} + h, {\text {train}} + 2 \times h]\) and a new model \(M_{\text {bml}}^{(2)}\) is trained on \([1, {\text {train}} + 2 \times h]\). The landmark approach is shown in Fig. 5.2.

Fig. 5.2
A set of 3 arrays has D train in the first cell and D test in the second cell. The D train cell increases in size in each array.

Landmark batch method with an prediction horizon

5.1.5 Window-Batch Method with Prediction Horizon

The method eval_bml_window implements a window approach. Here, too, the first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are similar to the landmark approach, with one important exception: The algorithm is not trained on the complete set of seen data. Instead, it is trained on a sliding window of size \(s_{\text {train}}\). The window batch approach is shown in Fig. 5.3.

Fig. 5.3
A set of 3 arrays has D train in the first cell in the first array and the second cell in the subsequent arrays. The arrays have D test in the second cell in the first array and the third cell in the subsequent ones. The second and third arrays have blank cells in the first cell that increase in size.

Window-batch method with a prediction horizon. This division of the training and test data set ensures that the size of the training data set \(s_{\text {train}}\) remains unchanged and that a prediction horizon h of the same size is always used

5.1.6 Online-Machine Learning with a Prediction Horizon

The method eval_oml_horizon implements an OML approach. This approach differs fundamentally from the batch approaches of ML, since each individual instance is used for prediction and training. If \(h=1\), a “pure” OML algorithm is implemented. If \(h>1\), the OML calculations are performed h times.

5.1.7 Online-Maschine Learning

The method eval_oml_iter_progressive is based on the method progressive_val_score from the package River.Footnote 1 The iterative procedure is shown in Fig. 5.4.

Table 5.1 provides a comparative overview of the selection methods.

Fig. 5.4
A set of 5 arrays have D train, D test and D train, D test and D train, D test and D train, and D test in the first, second, third, fourth, and fifth cells, respectively.

Iterative OML method. If the window size h is one, then an example is used for testing and then for training (updating) the OML algorithm. If \(h >1\), then the calculations are performed h times and the average of these h results is calculated

Table 5.1 Selection methods. The batches are represented by intervals, e.g., [ab]. In the OML approaches, each instance from the interval is passed to the online algorithm separately for prediction and updating (training)

5.2 Determining the Training and Test Data Set in the Package spotRiver

5.2.1 Methods for BML und OML

The BML algorithms require a training data set \(D_{\text {train}}\) of size \(s_{\text {train}}\) to adapt the model. The test data set \(D_{\text {test}}\) of size \(s_{\text {test}}\) is used to evaluate the model on new (unseen) data. For the comparative evaluation of BML and OML algorithms, the package Sequential Parameter Optimization Toolbox for River (spotRiver) provides five different methods.

Table 5.2 Evaluation functions for BML und OML

The four evaluation functions shown in Table 5.2 accept two data frames as arguments: a training and a test data set. In the pure OML environment, the fifth evaluation function eval_oml_iter_progressive is used. This uses only one (test) data set, as it implements the progressive validation. The parameters are shown in Table 5.3.

Table 5.3 Parameter for configuring the methods eval_bml_horizon, eval_bml_landmark, eval_bml_window and eval_oml_horizon from the package spotRiver. A tuple of two data frames is returned. The first contains the evaluation metrics for each batch of size horizon. The second contains the true and predicted values for each observation in the test data set

Example for the Method eval_oml_horizon

A screenshot has a snippet of several lines of code. It includes functions of from river import linear underscore model, from spot river dot evaluation dot e v a l underscore b m l, from s k learn dot metrics, metric, model, dataset, target underscore column, d f, train, and test, among others.

The method plot_bml_oml_horizon_metrics visualizes (1) the error (e.g., MAE), (2) the memory consumption (MB), and (3) the calculation time (s) for different models of ML on a given data set. The function takes a list of Pandas data frames as input, each containing the metrics for one model. The parameters of the method plot_bml_oml_horizon_metrics are shown in Table 5.4. Figure 5.5 shows the output of the metrics and Fig. 5.6 shows the residuals, i.e., the difference between the current (actual) and the predicted values.

A screenshot has a snippet of several lines of code. It includes functions of from spot river dot evaluation dot e v a l underscore b m l import, plot underscore b m l underscore o m l underscore horizon underscore metrics, d f underscore labels, d f underscore e v a l, and metric, among others.
Table 5.4 Parameters for configuring the method plot_bml_oml_horizon_metrics

5.2.2 Methods for OML River

The methods presented so far (in Sect. 5.2.1) are equally suitable for evaluating BML and OML models for three different data splits (1. horizon, 2. landmark and 3. window). In this section, the method eval-oml-iter-progressive is presented, which is specifically designed for the evaluation of OML models on a streaming data set. This is based on a method used in the River package. This makes it possible to compare the results with those of River. However, it cannot be used to evaluate BML models.

Fig. 5.5
A set of 3 line graphs of mean underscore absolute underscore error, computation time, and memory. They plot O M L linear with a decreasing trend, an increasing trend, and a constant trend with a sharp dip in the beginning, respectively.

Results of the method plot_bml_oml_horizon_metrics. Performance (here: MAE, computation time and memory consumption) of an OML linear model

Fig. 5.6
A double line graph of actual versus prediction. It plots the overlapping and intersecting lines of O M L linear and actual with a fluctuating trend and intense peaks and dips.

Results of the method plot_bml_oml_horizon_predictions: Representation of the values predicted by the model and the ground truth (“Actual”). It becomes clear how the OML model approaches the ground truth over time and learns the underlying relationship

The method eval-oml-iter-progressive evaluates one or more OML models on a streaming data set. The evaluation is done iteratively, and the models are tested in each “step” of the iteration. The results are returned in the form of a dictionary with metrics and their values. Table 5.5 shows the parameters of the method eval_oml-iter-progressive.

Table 5.5 Parameter for the configuration of the method eval_oml-iter-progressive from the package spotRiver. A dict (dictionary) with the evaluation results is returned. The keys are the names of the models and the values are dictionaries with the following keys: step: A list of iteration numbers at which the model was evaluated, error: A list of weighted errors for each iteration, r_time: A list of weighted runtimes for each iteration, memory: A list of weighted memory consumption for each iteration and metric_name: The name of the metric used for evaluation

The method plot_oml_iter_progressive visualizes the results based on the dictionary of evaluation results returned by eval_oml_iter_progressive. The visualization is based on the visualization in River.Footnote 2 Figure 5.7 shows the output.

A screenshot has a snippet of several lines of code. It includes functions of from river import datasets, from spot river dot evaluation dot e v a l underscore o m l import, from river import metrics, dataset, model, and plot underscore o m l underscore iter underscore progressive, among others.
Fig. 5.7
A set of 3 line graphs of M A E, time, and memory versus instances. It plots H A T R with a constant trend and a sharp dip in the beginning, an increasing trend, and a fluctuating trend, respectively.

Results of the method plot_oml_iter_progressive. The memory management of the Hoeffding Adaptive Tree Regressor (HATR) model is clearly visible

Notebook: Progressive Validation

An example of progressive validation can be found in the GitHub repository https://github.com/sn-code-inside/online-machine-learning shows how the delayed progressive validation can be applied using the moment and delay parameters in the method progressive_val_score. It is exploited that each observation in the data stream is shown to the model twice: once, to make a prediction and once to update the model when the true value is revealed.

The moment parameter determines which variable to use as a timestamp, while the delay parameter controls the waiting time before the true values are revealed to the model.

Tip

Further information on progressive validation can be found in the River package:

In addition, Grzenda et al. (2020) is worth mentioning, which deals with delayed, progressive validation.

5.3 Algorithm (Model) Performance

After the training and test data selection has been performed, the performance of the algorithm (or model) can be estimated. For this purpose, numerous metrics are available. Table 5.6 presents a selection of the metrics available in the package River. The selection of a suitable metric is crucial for the analysis of OML algorithms. For classification tasks, for example, accuracy is only a suitable metric if balanced classes are present. Kappa statistics (see Sect. A.4) are better suited for OML. Thomas and Uminsky (2022) give hints for the selection of suitable metrics.

Table 5.6 Metrics in the package River

The computation of the memory consumption is only simple at first glance. Programming languages such as Python or R perform memory management routines independently, which cannot be controlled by the user. For example, the garbage collector is not executed immediately after a call, since the program uses its own memory optimization routines, and it is sometimes more advantageous from its point of view not to delete the data. There are also many dependencies between individual objects, so they cannot simply be deleted even if this is desirable from the user’s point of view. These remarks apply equally to BML and OML methods. According to our research (exchange with R experts), the estimation of the memory consumption in the programming language R is more difficult than in Python. This was one of the reasons why the studies presented in Chaps. 9 and 10 were carried out with Python. The module tracemalloc, introduced in Python 3.4, was used.

5.4 Data Stream and Drift Generators

Most software packages provide functions for generating synthetic data streams (“data-stream generators”). As an example, we have listed the generators available in the package scikit-multiflow in Sect. 5.4. We also describe the SEA synthetic dataset (SEA) and Friedman-Drift generators, which are used in many OML publications that examine drift.

5.4.1 Data Stream Generators in Sklearn

For example, the package scikit-multiflow provides the following data stream generators:

  • Sine generator and anomaly sine generator

  • Mixed data stream generator

  • Random Radial Basis Function stream generator and Random Radial Basis Function stream generator with concept drift

  • Waveform stream generator

  • Regression generator.

Tip

By sorting the observations, concept drift can be simulated (Bifet & Gavaldà, 2009).

5.4.2 SEA-Drift Generator

The SEA is a frequently cited data set. Its generator implements the data stream with abrupt drift as described in Street and Kim (2001). Each observation consists of three features. Only the first two features are relevant. The target variable is binary and positive (true) if the sum of the features exceeds a certain threshold. There are four threshold values to choose from. Concept drift can be introduced at any time during the stream by switching the threshold.

In detail, the SEA data set is generated as follows: First, \(n=60{,}000\) random points are generated in a three-dimensional feature space. The features have values between 0 and 10, with only the first two features (\(f_1\) and \(f_2\)) being relevant. The n points are then divided into four blocks of 15,000 points each. In each block, the class membership of a point is determined by means of a threshold value \(\tau _i\), where i indicates the respective block. The threshold values \(\tau _1= 8\), \(\tau _2=9\), \(\tau _3=7\) and \(\tau _4 = 9.5\) are chosen. In addition, the data is noisy (“We inserted about 10% class noise into each block of data.”) by swapping 10% of the class memberships. Finally, a test set (\(n_t = 10{,}000\)) is determined, consisting of 2,500 data points from each block.

The Python package River provides the function SEA to generate the data. Figure 5.8 shows an instantiation of the SEA drift data.

Fig. 5.8
A graph of values from 0.55 to 0.75 versus values from 0.0 to 1 for 1 e 6. It plots 4 sets of fluctuating waves in a zig-zag pattern with intense peaks and dips divided by 3 vertical lines.

SEA-Data with drift. Concept changes occur after 250, 000 steps

5.4.3 Friedman-Drift Generator

The Friedman-Drift generator introduced in Definition 1.8 is another generator that is frequently cited in the literature (Ikonomovska, 2012). It generates a data stream that simulates the characteristics of streaming data that occur in practice. The generator is implemented in River as FriedmanDrift and is used in Sect. 9.2.

5.5 Summary

The interleaved test-then-train (or prequential evaluation) is a general method for evaluating learning algorithms in streaming scenarios. Interleaved test-then-train opens up interesting possibilities: The system is able to monitor the development of the learning process itself and to diagnose its development itself. The delayed progressive evaluation is the subject of current research and enables a realistic analysis of complex changes in online data streams. In addition to quality, however, other criteria/metrics must be taken into account, which are imposed by data stream properties. The available memory is one of the most important constraints. Another aspect is time, because algorithms must process the examples as quickly as (if not faster than) they arrive.

Note

The experimental studies in Chap. 9 use the following three properties for the comparison of BML and OML methods:

  1. 1.

    performance,

  2. 2.

    memory consumption and

  3. 3.

    time consumption.