Abstract
This chapter discusses aspects to be considered when evaluating Online Machine Learning (OML) algorithms, especially when comparing them to Batch Machine Learning (BML) algorithms. The following considerations play an important role:
-
1.
How are training and test data selected?
-
2.
How can performance be measured?
-
3.
What procedures are available for generating benchmark data sets?
Section 5.1 describes the selection of training and test data. Section 5.2 presents an implementation in Python for selecting training and test data. Section 5.3 describes the calculation of performance. Section 5.4 introduces the generation of benchmark data sets in the field of OML.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
5.1 Data Selection Methods
When determining the data selection method and calculating the performance, there is the greatest difference between BML and OML. Among other things, in OML the resources (memory and time, but not the data) are severely limited. In addition, Cross Validation (CV) is not possible. It is very important to determine which instances are used for training and for testing (and possibly also for validation).
For each of the selection approaches presented in the following, a metric must be selected, e.g., accuracy or Mean Absolute Error (MAE).
5.1.1 Holdout Selection
In the holdout evaluation method, the performance of the model is evaluated against a test data set, which consists of examples that have not yet been sighted. These examples are used only for evaluation purposes and not for the training of the model.
Definition 5.1
(Holdout) In the holdout evaluation method, the performance is evaluated after each batch, i.e., after a certain number of examples or observations. For this purpose, two parameters must be defined:
-
1.
Size of the (holdout-) window and
-
2.
frequency of testing.
The holdout evaluation is best when current and representative holdout data are used.
Why are holdout data not always used for OML? It is not always easy or even possible to obtain these data. In addition, the holdout data set must be representative, which cannot be guaranteed with streaming data due to possible changes. The holdout data of today can already be outdated tomorrow. If the period in which the holdout data are collected is too short, these data may contain essential relationships.
5.1.2 Progressive Validation: Interleaved Test-Then-Train
In statistics, progressive validation is generally understood to be the validation over a longer period of time, e.g., by using control charts. In the streaming data context, the term is used for approaches in which the individual instances are first used for testing (determining the quality of the model, the model calculates a prediction) and then for learning (training the model). Each individual instance is analyzed according to its arrival order. In addition to simple progressive validation, we also consider prequential validation and delayed progressive validation.
5.1.2.1 Progresssive Validation
Definition 5.2
(Progressive Validation) Each observation can be denoted as \((X_t, y_t)\), where \(X_t\) is a set of features, \(y_t\) is a label (or a prediction value), and t denotes the time (or simply the index). Before updating the model with the pair \((X_t, y_t)\), the model calculates a prediction for \(X_t\), so that \(\hat{y}_t\) is calculated. Using the ground truth \(y_t\) and the predicted value \(\hat{y}_t\) from the model, the online metric is then updated. Common metrics such as accuracy, MAE, Mean Squared Error (MSE), and Area Under The Curve, Receiver Operating Characteristics (ROC, AUC) are all sum values and can therefore be updated online.
This procedure can also be used for time series: If there are t observations \((x_1, x_2, \ldots , x_t)\), then the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\) can be used as \(X_t\) and the value \(x_t\) as \(y_t\). Alternatively, additional features can be calculated from the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\), which are then used as \(X_t\). Typical features are the information about the day of the week or the season.
5.1.2.2 Prequential Validation
Definition 5.3
(Prequential Validation) Prequential validation works like progressive validation (interleaved test-then-train). However, the new instances are more important than the old ones. This is implemented by a sliding window or a decay factor.
5.1.2.3 Delayed Progressive Validation
Typically, an OML model calculates a prediction \(\hat{y}_t\) and then learns. This was referred to as “progressive validation” in Sect. 5.1.2. The prediction and the observed value can be compared to measure the correctness of the model. This approach is often used to evaluate OML models. In some cases, this approach is not appropriate, because the prediction and the ground truth are not available at the same time. In this case, it makes sense to delay the process until the ground truth is available. This is called delayed progressive validation.
Delayed Progressive Validation
While evaluating a machine learning model, the goal is to simulate production conditions to get a trustworthy assessment of the model’s performance. For example, consider the number of bicycles needed for a bike rental for the next week. Once 7 days have passed, the actual demand is known, and we can update the model. What we really want is to evaluate the model by, for example, forecasting seven days in advance and only updating the model when the ground truth is available (Grzenda et al., 2020).
The delayed progressive validation is of great importance for practice: Instead of updating the model immediately after it has made a prediction, it is only updated when the ground truth is known. In this way, the model more accurately reflects the real process.
5.1.3 Machine Learning in Batch Mode with a Prediction Horizon
The method eval_bml_horizon implements the “classical” BML approach: The classical BML algorithm is trained once on the training data set, resulting in a model, say \(M^{(1)}_{\text {bml}}\), which is not changed: \(M_{\text {bml}}^{(1)} = M_{\text {bml}}\).
The model \(M_{\text {bml}}\) is evaluated on the test data, where the horizon, say \(h \in [1, s_{\text {test}}]\), comes into play: h specifies the size of the partitions into which \(D_{\text {test}}\) is divided. If \(h=s_{\text {test}}\), then the standard procedure of Machine Learning (ML) (“train-test”) is implemented. If \(h=1\), a pure OML-setting is simulated. The OML procedure is only simulated in this case, since the model \(M_{\text {bml}}\) is not updated or retrained. The BML approach is shown in Fig. 5.1.
If the entire test data set is used for the prediction horizon in the batch method, i.e., \(s_{\text {test}} = h\), then we obtain the classical holdout approach (see Sect. 5.1.1).
5.1.4 Landmark Batch Machine Learning with a Prediction Horizon
The method eval_bml_landmark implements a landmark approach. The first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are different: After a prediction with \(M_{\text {bml}}^{(1)}\) for the batch of data instances from the interval \([s_{\text {train}}, s_{\text {train}} + h]\) has been calculated, the algorithm is retrained on the interval \([1, s_{\text {train}} + h]\) and an updated model \(M_{\text {bml}}^{(2)}\) is available. In the third step of the landmark BML, \(M_{\text {bml}}^{(2)}\) calculates predictions for \([s_{\text {train}} + h, {\text {train}} + 2 \times h]\) and a new model \(M_{\text {bml}}^{(2)}\) is trained on \([1, {\text {train}} + 2 \times h]\). The landmark approach is shown in Fig. 5.2.
5.1.5 Window-Batch Method with Prediction Horizon
The method eval_bml_window implements a window approach. Here, too, the first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are similar to the landmark approach, with one important exception: The algorithm is not trained on the complete set of seen data. Instead, it is trained on a sliding window of size \(s_{\text {train}}\). The window batch approach is shown in Fig. 5.3.
5.1.6 Online-Machine Learning with a Prediction Horizon
The method eval_oml_horizon implements an OML approach. This approach differs fundamentally from the batch approaches of ML, since each individual instance is used for prediction and training. If \(h=1\), a “pure” OML algorithm is implemented. If \(h>1\), the OML calculations are performed h times.
5.1.7 Online-Maschine Learning
The method eval_oml_iter_progressive is based on the method progressive_val_score from the package River.Footnote 1 The iterative procedure is shown in Fig. 5.4.
Table 5.1 provides a comparative overview of the selection methods.
5.2 Determining the Training and Test Data Set in the Package spotRiver
5.2.1 Methods for BML und OML
The BML algorithms require a training data set \(D_{\text {train}}\) of size \(s_{\text {train}}\) to adapt the model. The test data set \(D_{\text {test}}\) of size \(s_{\text {test}}\) is used to evaluate the model on new (unseen) data. For the comparative evaluation of BML and OML algorithms, the package Sequential Parameter Optimization Toolbox for River (spotRiver) provides five different methods.
The four evaluation functions shown in Table 5.2 accept two data frames as arguments: a training and a test data set. In the pure OML environment, the fifth evaluation function eval_oml_iter_progressive is used. This uses only one (test) data set, as it implements the progressive validation. The parameters are shown in Table 5.3.
Example for the Method eval_oml_horizon
The method plot_bml_oml_horizon_metrics visualizes (1) the error (e.g., MAE), (2) the memory consumption (MB), and (3) the calculation time (s) for different models of ML on a given data set. The function takes a list of Pandas data frames as input, each containing the metrics for one model. The parameters of the method plot_bml_oml_horizon_metrics are shown in Table 5.4. Figure 5.5 shows the output of the metrics and Fig. 5.6 shows the residuals, i.e., the difference between the current (actual) and the predicted values.
5.2.2 Methods for OML River
The methods presented so far (in Sect. 5.2.1) are equally suitable for evaluating BML and OML models for three different data splits (1. horizon, 2. landmark and 3. window). In this section, the method eval-oml-iter-progressive is presented, which is specifically designed for the evaluation of OML models on a streaming data set. This is based on a method used in the River package. This makes it possible to compare the results with those of River. However, it cannot be used to evaluate BML models.
The method eval-oml-iter-progressive evaluates one or more OML models on a streaming data set. The evaluation is done iteratively, and the models are tested in each “step” of the iteration. The results are returned in the form of a dictionary with metrics and their values. Table 5.5 shows the parameters of the method eval_oml-iter-progressive.
The method plot_oml_iter_progressive visualizes the results based on the dictionary of evaluation results returned by eval_oml_iter_progressive. The visualization is based on the visualization in River.Footnote 2 Figure 5.7 shows the output.
Notebook: Progressive Validation
An example of progressive validation can be found in the GitHub repository https://github.com/sn-code-inside/online-machine-learning shows how the delayed progressive validation can be applied using the moment and delay parameters in the method progressive_val_score. It is exploited that each observation in the data stream is shown to the model twice: once, to make a prediction and once to update the model when the true value is revealed.
The moment parameter determines which variable to use as a timestamp, while the delay parameter controls the waiting time before the true values are revealed to the model.
Tip
Further information on progressive validation can be found in the River package:
In addition, Grzenda et al. (2020) is worth mentioning, which deals with delayed, progressive validation.
5.3 Algorithm (Model) Performance
After the training and test data selection has been performed, the performance of the algorithm (or model) can be estimated. For this purpose, numerous metrics are available. Table 5.6 presents a selection of the metrics available in the package River. The selection of a suitable metric is crucial for the analysis of OML algorithms. For classification tasks, for example, accuracy is only a suitable metric if balanced classes are present. Kappa statistics (see Sect. A.4) are better suited for OML. Thomas and Uminsky (2022) give hints for the selection of suitable metrics.
The computation of the memory consumption is only simple at first glance. Programming languages such as Python or R perform memory management routines independently, which cannot be controlled by the user. For example, the garbage collector is not executed immediately after a call, since the program uses its own memory optimization routines, and it is sometimes more advantageous from its point of view not to delete the data. There are also many dependencies between individual objects, so they cannot simply be deleted even if this is desirable from the user’s point of view. These remarks apply equally to BML and OML methods. According to our research (exchange with R experts), the estimation of the memory consumption in the programming language R is more difficult than in Python. This was one of the reasons why the studies presented in Chaps. 9 and 10 were carried out with Python. The module tracemalloc, introduced in Python 3.4, was used.
5.4 Data Stream and Drift Generators
Most software packages provide functions for generating synthetic data streams (“data-stream generators”). As an example, we have listed the generators available in the package scikit-multiflow in Sect. 5.4. We also describe the SEA synthetic dataset (SEA) and Friedman-Drift generators, which are used in many OML publications that examine drift.
5.4.1 Data Stream Generators in Sklearn
For example, the package scikit-multiflow provides the following data stream generators:
-
Sine generator and anomaly sine generator
-
Mixed data stream generator
-
Random Radial Basis Function stream generator and Random Radial Basis Function stream generator with concept drift
-
Waveform stream generator
-
Regression generator.
Tip
By sorting the observations, concept drift can be simulated (Bifet & Gavaldà, 2009).
5.4.2 SEA-Drift Generator
The SEA is a frequently cited data set. Its generator implements the data stream with abrupt drift as described in Street and Kim (2001). Each observation consists of three features. Only the first two features are relevant. The target variable is binary and positive (true) if the sum of the features exceeds a certain threshold. There are four threshold values to choose from. Concept drift can be introduced at any time during the stream by switching the threshold.
In detail, the SEA data set is generated as follows: First, \(n=60{,}000\) random points are generated in a three-dimensional feature space. The features have values between 0 and 10, with only the first two features (\(f_1\) and \(f_2\)) being relevant. The n points are then divided into four blocks of 15,000 points each. In each block, the class membership of a point is determined by means of a threshold value \(\tau _i\), where i indicates the respective block. The threshold values \(\tau _1= 8\), \(\tau _2=9\), \(\tau _3=7\) and \(\tau _4 = 9.5\) are chosen. In addition, the data is noisy (“We inserted about 10% class noise into each block of data.”) by swapping 10% of the class memberships. Finally, a test set (\(n_t = 10{,}000\)) is determined, consisting of 2,500 data points from each block.
The Python package River provides the function SEA to generate the data. Figure 5.8 shows an instantiation of the SEA drift data.
5.4.3 Friedman-Drift Generator
The Friedman-Drift generator introduced in Definition 1.8 is another generator that is frequently cited in the literature (Ikonomovska, 2012). It generates a data stream that simulates the characteristics of streaming data that occur in practice. The generator is implemented in River as FriedmanDrift and is used in Sect. 9.2.
5.5 Summary
The interleaved test-then-train (or prequential evaluation) is a general method for evaluating learning algorithms in streaming scenarios. Interleaved test-then-train opens up interesting possibilities: The system is able to monitor the development of the learning process itself and to diagnose its development itself. The delayed progressive evaluation is the subject of current research and enables a realistic analysis of complex changes in online data streams. In addition to quality, however, other criteria/metrics must be taken into account, which are imposed by data stream properties. The available memory is one of the most important constraints. Another aspect is time, because algorithms must process the examples as quickly as (if not faster than) they arrive.
Note
The experimental studies in Chap. 9 use the following three properties for the comparison of BML and OML methods:
-
1.
performance,
-
2.
memory consumption and
-
3.
time consumption.
Notes
- 1.
- 2.
See (Incremental decision trees in River: the Hoeffding Tree case) [https://riverml.xyz/0.15.0/recipes/on-hoeffding-trees/].
References
Bifet, A., & Gavaldà, R. (2009). Adaptive learning from evolving data streams. In Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII, IDA’09 (pp. 249–260). Springer.
Grzenda, M., Gomes, H. M., & Bifet, A. (2020). Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery, 34(5), 1237–1266.
Ikonomovska, E. (2012). Algorithms for learning regression trees and ensembles on evolving data streams. Ph.D. Thesis, Jozef Stefan International Postgraduate School.
Street, W. N., & Kim, Y. S. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’01 (pp. 377–382). Association for Computing Machinery.
Thomas, R. L., & Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 1--8.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Bartz-Beielstein, T. (2024). Evaluation and Performance Measurement. In: Bartz, E., Bartz-Beielstein, T. (eds) Online Machine Learning. Machine Learning: Foundations, Methodologies, and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-99-7007-0_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-7007-0_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7006-3
Online ISBN: 978-981-99-7007-0
eBook Packages: Computer ScienceComputer Science (R0)