Evaluation and Performance Measurement

Bartz-Beielstein, Thomas

doi:10.1007/978-981-99-7007-0_5

Thomas Bartz-Beielstein⁵

Part of the book series: Machine Learning: Foundations, Methodologies, and Applications ((MLFMA))

894 Accesses

Abstract

This chapter discusses aspects to be considered when evaluating Online Machine Learning (OML) algorithms, especially when comparing them to Batch Machine Learning (BML) algorithms. The following considerations play an important role:

1.
How are training and test data selected?
2.
How can performance be measured?
3.
What procedures are available for generating benchmark data sets?

Section 5.1 describes the selection of training and test data. Section 5.2 presents an implementation in Python for selecting training and test data. Section 5.3 describes the calculation of performance. Section 5.4 introduces the generation of benchmark data sets in the field of OML.

Access provided by Autonomous University of Puebla. Download chapter PDF

PMLB: a large benchmark suite for machine learning evaluation and comparison

Article Open access 11 December 2017

Machine Learning—Evaluation (Cross-validation, Metrics, Importance Scores...)

Machine Learning: Towards an Unified Classification Criteria

5.1 Data Selection Methods

When determining the data selection method and calculating the performance, there is the greatest difference between BML and OML. Among other things, in OML the resources (memory and time, but not the data) are severely limited. In addition, Cross Validation (CV) is not possible. It is very important to determine which instances are used for training and for testing (and possibly also for validation).

For each of the selection approaches presented in the following, a metric must be selected, e.g., accuracy or Mean Absolute Error (MAE).

5.1.1 Holdout Selection

In the holdout evaluation method, the performance of the model is evaluated against a test data set, which consists of examples that have not yet been sighted. These examples are used only for evaluation purposes and not for the training of the model.

Definition 5.1

(Holdout) In the holdout evaluation method, the performance is evaluated after each batch, i.e., after a certain number of examples or observations. For this purpose, two parameters must be defined:

1.
Size of the (holdout-) window and
2.
frequency of testing.

The holdout evaluation is best when current and representative holdout data are used.

Why are holdout data not always used for OML? It is not always easy or even possible to obtain these data. In addition, the holdout data set must be representative, which cannot be guaranteed with streaming data due to possible changes. The holdout data of today can already be outdated tomorrow. If the period in which the holdout data are collected is too short, these data may contain essential relationships.

5.1.2 Progressive Validation: Interleaved Test-Then-Train

In statistics, progressive validation is generally understood to be the validation over a longer period of time, e.g., by using control charts. In the streaming data context, the term is used for approaches in which the individual instances are first used for testing (determining the quality of the model, the model calculates a prediction) and then for learning (training the model). Each individual instance is analyzed according to its arrival order. In addition to simple progressive validation, we also consider prequential validation and delayed progressive validation.

5.1.2.1 Progresssive Validation

Definition 5.2

(Progressive Validation) Each observation can be denoted as \((X_t, y_t)\), where \(X_t\) is a set of features, \(y_t\) is a label (or a prediction value), and t denotes the time (or simply the index). Before updating the model with the pair \((X_t, y_t)\), the model calculates a prediction for \(X_t\), so that \(\hat{y}_t\) is calculated. Using the ground truth \(y_t\) and the predicted value \(\hat{y}_t\) from the model, the online metric is then updated. Common metrics such as accuracy, MAE, Mean Squared Error (MSE), and Area Under The Curve, Receiver Operating Characteristics (ROC, AUC) are all sum values and can therefore be updated online.

This procedure can also be used for time series: If there are t observations \((x_1, x_2, \ldots , x_t)\), then the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\) can be used as \(X_t\) and the value \(x_t\) as \(y_t\). Alternatively, additional features can be calculated from the values \((x_{t-k}, x_{t-k+1}, \ldots , x_{t-1})\), which are then used as \(X_t\). Typical features are the information about the day of the week or the season.

5.1.2.2 Prequential Validation

Definition 5.3

(Prequential Validation) Prequential validation works like progressive validation (interleaved test-then-train). However, the new instances are more important than the old ones. This is implemented by a sliding window or a decay factor.

5.1.2.3 Delayed Progressive Validation

Typically, an OML model calculates a prediction \(\hat{y}_t\) and then learns. This was referred to as “progressive validation” in Sect. 5.1.2. The prediction and the observed value can be compared to measure the correctness of the model. This approach is often used to evaluate OML models. In some cases, this approach is not appropriate, because the prediction and the ground truth are not available at the same time. In this case, it makes sense to delay the process until the ground truth is available. This is called delayed progressive validation.

Delayed Progressive Validation

While evaluating a machine learning model, the goal is to simulate production conditions to get a trustworthy assessment of the model’s performance. For example, consider the number of bicycles needed for a bike rental for the next week. Once 7 days have passed, the actual demand is known, and we can update the model. What we really want is to evaluate the model by, for example, forecasting seven days in advance and only updating the model when the ground truth is available (Grzenda et al., 2020).

The delayed progressive validation is of great importance for practice: Instead of updating the model immediately after it has made a prediction, it is only updated when the ground truth is known. In this way, the model more accurately reflects the real process.

5.1.3 Machine Learning in Batch Mode with a Prediction Horizon

The method eval_bml_horizon implements the “classical” BML approach: The classical BML algorithm is trained once on the training data set, resulting in a model, say \(M^{(1)}_{\text {bml}}\), which is not changed: \(M_{\text {bml}}^{(1)} = M_{\text {bml}}\).

The model \(M_{\text {bml}}\) is evaluated on the test data, where the horizon, say \(h \in [1, s_{\text {test}}]\), comes into play: h specifies the size of the partitions into which \(D_{\text {test}}\) is divided. If \(h=s_{\text {test}}\), then the standard procedure of Machine Learning (ML) (“train-test”) is implemented. If \(h=1\), a pure OML-setting is simulated. The OML procedure is only simulated in this case, since the model \(M_{\text {bml}}\) is not updated or retrained. The BML approach is shown in Fig. 5.1.

A set of 4 arrays has D train in the first cell with D test in the second, third, and fourth cells in the first 3 arrays. The final array has 2 cells of D train and D test where D test has a bigger size. — **Fig. 5.1**

If the entire test data set is used for the prediction horizon in the batch method, i.e., \(s_{\text {test}} = h\), then we obtain the classical holdout approach (see Sect. 5.1.1).

5.1.4 Landmark Batch Machine Learning with a Prediction Horizon

The method eval_bml_landmark implements a landmark approach. The first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are different: After a prediction with \(M_{\text {bml}}^{(1)}\) for the batch of data instances from the interval \([s_{\text {train}}, s_{\text {train}} + h]\) has been calculated, the algorithm is retrained on the interval \([1, s_{\text {train}} + h]\) and an updated model \(M_{\text {bml}}^{(2)}\) is available. In the third step of the landmark BML, \(M_{\text {bml}}^{(2)}\) calculates predictions for \([s_{\text {train}} + h, {\text {train}} + 2 \times h]\) and a new model \(M_{\text {bml}}^{(2)}\) is trained on \([1, {\text {train}} + 2 \times h]\). The landmark approach is shown in Fig. 5.2.

A set of 3 arrays has D train in the first cell and D test in the second cell. The D train cell increases in size in each array. — **Fig. 5.2**

5.1.5 Window-Batch Method with Prediction Horizon

The method eval_bml_window implements a window approach. Here, too, the first step is similar to the first step of the BML approach and \(M_{\text {bml}}^{(1)}\) is available. The following steps are similar to the landmark approach, with one important exception: The algorithm is not trained on the complete set of seen data. Instead, it is trained on a sliding window of size \(s_{\text {train}}\). The window batch approach is shown in Fig. 5.3.

A set of 3 arrays has D train in the first cell in the first array and the second cell in the subsequent arrays. The arrays have D test in the second cell in the first array and the third cell in the subsequent ones. The second and third arrays have blank cells in the first cell that increase in size. — **Fig. 5.3**

5.1.6 Online-Machine Learning with a Prediction Horizon

The method eval_oml_horizon implements an OML approach. This approach differs fundamentally from the batch approaches of ML, since each individual instance is used for prediction and training. If \(h=1\), a “pure” OML algorithm is implemented. If \(h>1\), the OML calculations are performed h times.

5.1.7 Online-Maschine Learning

The method eval_oml_iter_progressive is based on the method progressive_val_score from the package River.^{Footnote 1} The iterative procedure is shown in Fig. 5.4.

Table 5.1 provides a comparative overview of the selection methods.

A set of 5 arrays have D train, D test and D train, D test and D train, D test and D train, and D test in the first, second, third, fourth, and fifth cells, respectively. — **Fig. 5.4**

Table 5.1 Selection methods. The batches are represented by intervals, e.g., [a, b]. In the OML approaches, each instance from the interval is passed to the online algorithm separately for prediction and updating (training)

Full size table

5.2 Determining the Training and Test Data Set in the Package spotRiver

5.2.1 Methods for BML und OML

The BML algorithms require a training data set \(D_{\text {train}}\) of size \(s_{\text {train}}\) to adapt the model. The test data set \(D_{\text {test}}\) of size \(s_{\text {test}}\) is used to evaluate the model on new (unseen) data. For the comparative evaluation of BML and OML algorithms, the package Sequential Parameter Optimization Toolbox for River (spotRiver) provides five different methods.

Table 5.2 Evaluation functions for BML und OML

Full size table

The four evaluation functions shown in Table 5.2 accept two data frames as arguments: a training and a test data set. In the pure OML environment, the fifth evaluation function eval_oml_iter_progressive is used. This uses only one (test) data set, as it implements the progressive validation. The parameters are shown in Table 5.3.

Table 5.3 Parameter for configuring the methods eval_bml_horizon, eval_bml_landmark, eval_bml_window and eval_oml_horizon from the package spotRiver. A tuple of two data frames is returned. The first contains the evaluation metrics for each batch of size horizon. The second contains the true and predicted values for each observation in the test data set

Full size table

Example for the Method eval_oml_horizon

A screenshot has a snippet of several lines of code. It includes functions of from river import linear underscore model, from spot river dot evaluation dot e v a l underscore b m l, from s k learn dot metrics, metric, model, dataset, target underscore column, d f, train, and test, among others.

The method plot_bml_oml_horizon_metrics visualizes (1) the error (e.g., MAE), (2) the memory consumption (MB), and (3) the calculation time (s) for different models of ML on a given data set. The function takes a list of Pandas data frames as input, each containing the metrics for one model. The parameters of the method plot_bml_oml_horizon_metrics are shown in Table 5.4. Figure 5.5 shows the output of the metrics and Fig. 5.6 shows the residuals, i.e., the difference between the current (actual) and the predicted values.

A screenshot has a snippet of several lines of code. It includes functions of from spot river dot evaluation dot e v a l underscore b m l import, plot underscore b m l underscore o m l underscore horizon underscore metrics, d f underscore labels, d f underscore e v a l, and metric, among others.

Table 5.4 Parameters for configuring the method plot_bml_oml_horizon_metrics

Full size table

5.2.2 Methods for OML River

The methods presented so far (in Sect. 5.2.1) are equally suitable for evaluating BML and OML models for three different data splits (1. horizon, 2. landmark and 3. window). In this section, the method eval-oml-iter-progressive is presented, which is specifically designed for the evaluation of OML models on a streaming data set. This is based on a method used in the River package. This makes it possible to compare the results with those of River. However, it cannot be used to evaluate BML models.

A set of 3 line graphs of mean underscore absolute underscore error, computation time, and memory. They plot O M L linear with a decreasing trend, an increasing trend, and a constant trend with a sharp dip in the beginning, respectively. — **Fig. 5.5**

A double line graph of actual versus prediction. It plots the overlapping and intersecting lines of O M L linear and actual with a fluctuating trend and intense peaks and dips. — **Fig. 5.6**

The method eval-oml-iter-progressive evaluates one or more OML models on a streaming data set. The evaluation is done iteratively, and the models are tested in each “step” of the iteration. The results are returned in the form of a dictionary with metrics and their values. Table 5.5 shows the parameters of the method eval_oml-iter-progressive.

Table 5.5 Parameter for the configuration of the method eval_oml-iter-progressive from the package spotRiver. A dict (dictionary) with the evaluation results is returned. The keys are the names of the models and the values are dictionaries with the following keys: step: A list of iteration numbers at which the model was evaluated, error: A list of weighted errors for each iteration, r_time: A list of weighted runtimes for each iteration, memory: A list of weighted memory consumption for each iteration and metric_name: The name of the metric used for evaluation

Full size table

The method plot_oml_iter_progressive visualizes the results based on the dictionary of evaluation results returned by eval_oml_iter_progressive. The visualization is based on the visualization in River.^{Footnote 2} Figure 5.7 shows the output.

A screenshot has a snippet of several lines of code. It includes functions of from river import datasets, from spot river dot evaluation dot e v a l underscore o m l import, from river import metrics, dataset, model, and plot underscore o m l underscore iter underscore progressive, among others.

A set of 3 line graphs of M A E, time, and memory versus instances. It plots H A T R with a constant trend and a sharp dip in the beginning, an increasing trend, and a fluctuating trend, respectively. — **Fig. 5.7**

Notebook: Progressive Validation

An example of progressive validation can be found in the GitHub repository https://github.com/sn-code-inside/online-machine-learning shows how the delayed progressive validation can be applied using the moment and delay parameters in the method progressive_val_score. It is exploited that each observation in the data stream is shown to the model twice: once, to make a prediction and once to update the model when the true value is revealed.

The moment parameter determines which variable to use as a timestamp, while the delay parameter controls the waiting time before the true values are revealed to the model.

Tip

Further information on progressive validation can be found in the River package:

In addition, Grzenda et al. (2020) is worth mentioning, which deals with delayed, progressive validation.

5.3 Algorithm (Model) Performance

After the training and test data selection has been performed, the performance of the algorithm (or model) can be estimated. For this purpose, numerous metrics are available. Table 5.6 presents a selection of the metrics available in the package River. The selection of a suitable metric is crucial for the analysis of OML algorithms. For classification tasks, for example, accuracy is only a suitable metric if balanced classes are present. Kappa statistics (see Sect. A.4) are better suited for OML. Thomas and Uminsky (2022) give hints for the selection of suitable metrics.

Table 5.6 Metrics in the package River

Full size table

The computation of the memory consumption is only simple at first glance. Programming languages such as Python or R perform memory management routines independently, which cannot be controlled by the user. For example, the garbage collector is not executed immediately after a call, since the program uses its own memory optimization routines, and it is sometimes more advantageous from its point of view not to delete the data. There are also many dependencies between individual objects, so they cannot simply be deleted even if this is desirable from the user’s point of view. These remarks apply equally to BML and OML methods. According to our research (exchange with R experts), the estimation of the memory consumption in the programming language R is more difficult than in Python. This was one of the reasons why the studies presented in Chaps. 9 and 10 were carried out with Python. The module tracemalloc, introduced in Python 3.4, was used.

5.4 Data Stream and Drift Generators

Most software packages provide functions for generating synthetic data streams (“data-stream generators”). As an example, we have listed the generators available in the package scikit-multiflow in Sect. 5.4. We also describe the SEA synthetic dataset (SEA) and Friedman-Drift generators, which are used in many OML publications that examine drift.

5.4.1 Data Stream Generators in Sklearn

For example, the package scikit-multiflow provides the following data stream generators:

Sine generator and anomaly sine generator
Mixed data stream generator
Random Radial Basis Function stream generator and Random Radial Basis Function stream generator with concept drift
Waveform stream generator
Regression generator.

Tip

By sorting the observations, concept drift can be simulated (Bifet & Gavaldà, 2009).

5.4.2 SEA-Drift Generator

The SEA is a frequently cited data set. Its generator implements the data stream with abrupt drift as described in Street and Kim (2001). Each observation consists of three features. Only the first two features are relevant. The target variable is binary and positive (true) if the sum of the features exceeds a certain threshold. There are four threshold values to choose from. Concept drift can be introduced at any time during the stream by switching the threshold.

In detail, the SEA data set is generated as follows: First, \(n=60{,}000\) random points are generated in a three-dimensional feature space. The features have values between 0 and 10, with only the first two features (\(f_1\) and \(f_2\)) being relevant. The n points are then divided into four blocks of 15,000 points each. In each block, the class membership of a point is determined by means of a threshold value \(\tau _i\), where i indicates the respective block. The threshold values \(\tau _1= 8\), \(\tau _2=9\), \(\tau _3=7\) and \(\tau _4 = 9.5\) are chosen. In addition, the data is noisy (“We inserted about 10% class noise into each block of data.”) by swapping 10% of the class memberships. Finally, a test set (\(n_t = 10{,}000\)) is determined, consisting of 2,500 data points from each block.

The Python package River provides the function SEA to generate the data. Figure 5.8 shows an instantiation of the SEA drift data.

A graph of values from 0.55 to 0.75 versus values from 0.0 to 1 for 1 e 6. It plots 4 sets of fluctuating waves in a zig-zag pattern with intense peaks and dips divided by 3 vertical lines. — **Fig. 5.8**

5.4.3 Friedman-Drift Generator

The Friedman-Drift generator introduced in Definition 1.8 is another generator that is frequently cited in the literature (Ikonomovska, 2012). It generates a data stream that simulates the characteristics of streaming data that occur in practice. The generator is implemented in River as FriedmanDrift and is used in Sect. 9.2.

5.5 Summary

The interleaved test-then-train (or prequential evaluation) is a general method for evaluating learning algorithms in streaming scenarios. Interleaved test-then-train opens up interesting possibilities: The system is able to monitor the development of the learning process itself and to diagnose its development itself. The delayed progressive evaluation is the subject of current research and enables a realistic analysis of complex changes in online data streams. In addition to quality, however, other criteria/metrics must be taken into account, which are imposed by data stream properties. The available memory is one of the most important constraints. Another aspect is time, because algorithms must process the examples as quickly as (if not faster than) they arrive.

Note

The experimental studies in Chap. 9 use the following three properties for the comparison of BML and OML methods:

1.
performance,
2.
memory consumption and
3.
time consumption.

Notes

1.
See https://riverml.xyz/0.15.0/api/evaluate/progressive-val-score/.
2.
See (Incremental decision trees in River: the Hoeffding Tree case) [https://riverml.xyz/0.15.0/recipes/on-hoeffding-trees/].

References

Bifet, A., & Gavaldà, R. (2009). Adaptive learning from evolving data streams. In Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII, IDA’09 (pp. 249–260). Springer.
Google Scholar
Grzenda, M., Gomes, H. M., & Bifet, A. (2020). Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery, 34(5), 1237–1266.
Google Scholar
Ikonomovska, E. (2012). Algorithms for learning regression trees and ensembles on evolving data streams. Ph.D. Thesis, Jozef Stefan International Postgraduate School.
Google Scholar
Street, W. N., & Kim, Y. S. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’01 (pp. 377–382). Association for Computing Machinery.
Google Scholar
Thomas, R. L., & Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 1--8.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
Thomas Bartz-Beielstein

Authors

Thomas Bartz-Beielstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Bartz-Beielstein .

Editor information

Editors and Affiliations

Bartz & Bartz GmbH, Gummersbach, Germany
Eva Bartz
Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
Thomas Bartz-Beielstein

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bartz-Beielstein, T. (2024). Evaluation and Performance Measurement. In: Bartz, E., Bartz-Beielstein, T. (eds) Online Machine Learning. Machine Learning: Foundations, Methodologies, and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-99-7007-0_5

Download citation

DOI: https://doi.org/10.1007/978-981-99-7007-0_5
Published: 06 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7006-3
Online ISBN: 978-981-99-7007-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluation and Performance Measurement

Abstract

Similar content being viewed by others

PMLB: a large benchmark suite for machine learning evaluation and comparison

Machine Learning—Evaluation (Cross-validation, Metrics, Importance Scores...)

Machine Learning: Towards an Unified Classification Criteria

5.1 Data Selection Methods

5.1.1 Holdout Selection

Definition 5.1

5.1.2 Progressive Validation: Interleaved Test-Then-Train

5.1.2.1 Progresssive Validation

Definition 5.2

5.1.2.2 Prequential Validation

Definition 5.3

5.1.2.3 Delayed Progressive Validation

Delayed Progressive Validation

5.1.3 Machine Learning in Batch Mode with a Prediction Horizon

5.1.4 Landmark Batch Machine Learning with a Prediction Horizon

5.1.5 Window-Batch Method with Prediction Horizon

5.1.6 Online-Machine Learning with a Prediction Horizon

5.1.7 Online-Maschine Learning

5.2 Determining the Training and Test Data Set in the Package spotRiver

5.2.1 Methods for BML und OML

Example for the Method eval_oml_horizon

5.2.2 Methods for OML River

5.3 Algorithm (Model) Performance

5.4 Data Stream and Drift Generators

5.4.1 Data Stream Generators in Sklearn

5.4.2 SEA-Drift Generator

5.4.3 Friedman-Drift Generator

5.5 Summary

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation