Keywords

1 Introduction

Analytical techniques used in laboratories are frequently insufficient since they necessitate a large number of samples, a longer time to receive results, and highly technical personnel (Zou & Zhao, 2015). In an environment where speed is critical, engineering advances must require fewer samples or, at the very least, no one (nondestructive techniques): a) they must provide prompt, if not immediate, responses in order for the operator to make an informed decision on the next steps to regulate or release the product to the market; b) they must be simple to use in order to encourage their use across the manufacturing chain, where analytical laboratories are not always available. As a result, technology needs to be adjusted to a new strategy of production: the use of sensors and the necessarily associated information extraction system, which allows “measurement,” to meet the demands of the agri-food stakeholders. Moreover, the manufacturers of technologies often provide devices that require calibration phases not always easy to perform but that are often the subject of actual researches. These are particularly complex when similar processes need to run repeatedly (Eriksson et al., 2013).

Hence, chemometrics techniques in nondestructive quality evaluation aim to produce an empirical or semi-empirical model from data that may be used to predict one or more chemical properties of a system from observations (Cocchi, 2017). Chemometrics employs mathematical and statistical methodologies to optimize experimental processes through the scientific design of experiments, to treat the experimental data and to extract as much relevant chemical information as possible from generated data (Guidetti et al., 2012). Chemical systems are often multivariate, which means that numerous information are obtained at the same time. As a result, the majority of chemometric procedures fall within the category of analytical methods known as multivariate statistical analysis, which contains many measurements on a number of individuals, objects, or data samples. Therefore, multiple measurements and analyses of the variable dependence are central to chemometrics (Marini, 2013). This chapter aims to present an overview and to provide a clear understanding of the chemometric methods with their advantages and disadvantages used in the nondestructive quality evaluation of fruits and vegetables.

2 Major Chemometric Tools in Food Analysis

Chemometric approaches are used to optimize the experimental process and extract relevant chemical information from massive quantities of data, identify hidden relationships, and provide a visual approach. There are several chemometric approaches: design of experiment (DoE), preprocessing, explorative analysis, classification, regression, validation, feature selection, multiway analysis, etc. These methods are utilized for nondestructive quality analysis of fruits and vegetables, as well as in other areas of food science and technology. The chosen method is determined by the challenge, type of experimental data, as well as by considering the pros and cons of that particular chemometric approach (Martens & Martens, 2001).

2.1 Design of Experiment

DoE technique ensures representativeness of the sample, allows for the evaluation of the primary sources of variability, and is the most effective way to optimize analytical measurement processes (Lawson, 2014). Experimental designs are frequently neglected or undervalued; however, in order to illustrate the need for variable optimization as well as the development of adequate methods for carrying out the tests, a correct experimental design must be established in advance (Granato and de Araújo Calado, 2013; Leardi, 2009; Wold et al., 2004). It is essentially decided how to carry out scientific research. When developing new detection systems, the optimization protocol is especially important. A well-defined DoE not only allows scientists to investigate different factors and their interactions, but it also saves money (Granato and de Araújo Calado, 2013; Leardi, 2006, 2009; Wold et al., 2004). Figure 1 illustrates various methods of experimental designs used in chemometrics.

Fig. 1
A circle spoke diagram represents D o E design methods used in chemometrics. They are full factorial, placket Burman, fractional factorial, central composite, Doehlert, and D-optimal.

Common experimental design methods in chemometrics

The use of a specific design depends on the specific problem statement. For example, some methods are used in optimization while some methods are used in screening experiments. The advantages and disadvantages of some common experimental designs are presented in Table 1.

Table 1 Advantages and disadvantages of some common experimental designs

2.2 Preprocessing of Data

After collecting data, data preprocessing is frequently the deciding factor between excellent and poor chemometric models (Rinnan, 2014). Preprocessing is used to reduce variation that is not connected to the topic of interest, allowing the variation of interest to stand out more and be more easily modeled (Islam et al., 2018b). There are some situations where enhancing spectral features is essential; for example, in a situation where intriguing spectral characteristics differ slightly from the global intensity, where small peaks are difficult to see in the presence of a large one and due to overlapped peaks. According to Roger et al. (2020), there are several types of systematic variations that are not related to topics of interest, for example, the shift of the baseline due to light scattering effects as a result of various particle sizes, offsets of baseline due to differences in instrumentation, and variations in the signal intensities due to size, shape, and volume of the sample. Figure 2 illustrates the available data preprocessing methods in chemometrics.

Fig. 2
A diagram illustrates 4 preprocessing methods. Scaling include Pareto, Autoscaling. Filtering include S A V-G O L smoothing, O S C. Normalization include S N V, M S C. Transformations include Absolute value, Arithmatic operation.

Available preprocessing methods in chemometrics

Among all the preprocessing methods, mean centering, standard normal variate (SNV) normalization, baseline correction, orthogonal signal correction (OSC), Savitzky–Golay (SAV-GOL) smoothing and derivatization, and multiplicative signal correction (MSC) are the most common data preprocessing methods used in chemometrics (Vidal and Amigo 2012). The visualization of the data and the removal of severe bands that are driven by noise are the first steps in near-infrared (NIR) spectroscopic preprocessing. Then, to reduce any high-frequency noise, window-based smoothing techniques can be performed. The SAV-GOL algorithm is a widely used approach for reducing high-frequency noise. It includes fitting a polynomial of chosen order into a band of the defined size that is moved throughout the entire spectrum (Rinnan et al., 2009). In an ideal circumstance, the smoothed spectra should be ready for regression or classification modeling in the presence of just absorption features. The smoothing stage is frequently followed by scattering correction methods due to the prominence of scattering effects. The estimation of the second derivative of the spectra is the most frequent method since it can quickly remove first-order additive (baseline shift) effects and also show underlying peaks that would otherwise be invisible (Rinnan et al., 2009). Another widely used method is SNV, which includes subtracting each spectrum’s mean spectral intensity from each intensity response and then dividing by its spectral-domain standard deviation (Barnes et al., 1989). SNV can be used to eliminate additive and multiplicative effects. In NIR modeling, both the second derivative and SNV are quite useful and usually increase model prediction performance. Another prominent method is MSC, which assumes the spectrum has a multiplicative, additive, and residual component (Isaksson & Næs, 1988). In order to describe these impacts, the Extended MSC (EMSC) model incorporates higher-order complex relations (Martens et al., 2003).

In many cases, the scattering effect is also important to perfectly describe the quality of fresh fruits and vegetables. Therefore, the removal of the scattering effect in those cases may lead to the wrong chemometric model (Mishra et al., 2021). Robust normal variate (RNV) (Guo et al., 1999), probabilistic quotient normalization (PQN) (Dieterle et al., 2006), and variable sorting for normalization (VSN) (Rabatel et al., 2020) have all been offered as improvements and alternatives to SNV. To summarize, there are numerous chemometric preprocessing approaches for removing/reducing scattering effects from spectral data. Table 2 summarizes the advantages and disadvantages of the common data preprocessing methods used in chemometrics.

Table 2 Some common preprocessing methods with their pros and cons

3 Principal Component Analysis (PCA)

Principal component analysis (PCA) is one of the most important and powerful methods in chemometrics (Bro & Smilde, 2014). PCA is a bilinear reduction approach that may condense enormous amounts of data into a few parameters known as principal components (PCs) or latent variables, which reflect the levels, differences, and similarities among the samples and variables that make up the modeled data. A linear transformation is used to accomplish this goal, with the constraints of conserving data variance and imposing orthogonality on the latent variables (Smilde et al., 2005).

3.1 PCA Data Analysis

PCA can be used to visualize the X data matrix in the multivariate space, cluster identification and detection of outliers, reducing the dimensionality of the data and removing the noise. The starting point for PCA is a matrix of data with N rows (observations) and M columns (variables), here denoted by X. Technically, PCA seeks lines, planes, and hyperplanes in K-dimensional space that best approximate the data in terms of least squares. It is obvious that a line or plane that is the least squares approximation of a set of data points minimizes the variance of the coordinates on the line or plane (Wold et al., 1987).

The first PC is the line in K-dimensional space that best approximates the data in terms of least squares. The line intersects the mean point. Hence, each observation can be projected onto this line to obtain a coordinate value along the PC line. This new coordinate value is referred to as a score. A second PC is a line in K-dimensional variable space that is orthogonal to the first PC. This line likewise crosses through the average point and enhances the X-data approximation as much as feasible. If X is a data matrix with N rows and M columns, and with each variable being a column and each sample a row, PCA decomposes X as the sum of r ti and pi, and where r is the rank of the matrix X (Eq. 1).

2 equations. The first is X is equal to t subscript 1 p T subscript 1 plus t subscript 2 p T subscript 2 and so on. The second is X is equal to t subscript 1 p T subscript 1 plus t subscript 2 p T subscript 2 and so on.

The amount of variance captured by ti, pi pairs are ordered. The scores are vectors that include information about how the samples relate to one another. The vectors are called loadings and they provide information on how the variables interact. In general, after m components, the PCA model is usually truncated, and the small variance factors are consolidated into a residual matrix E (Eq. 2).

The basic premise is that the investigated systems are “indirectly observable,” meaning that the relevant phenomena that cause data variation/patterns are concealed and not directly measurable/observable. This is where the phrase “latent variables” comes from. Latent variables (PCs) can be expressed as scatter plots in the Euclidean plane once they have been discovered. A loading plot can be discussed in conjunction with the associated score plot, which is generated for the same pair of PCs, or it can be directly shown in the same figure, which is called a biplot. It becomes easier to explain the groups or patterns observed in the PC space in terms of the original variables in this way. Although the biplot format for spectral data is difficult to visualize, specific spectral regions that are responsible for the separation of process phases can be highlighted.

3.2 Outlier Detection

Q residuals are the sum of squared residuals for each sample. In other words, Q is a measure of the distance of a sample from the PCA model. Therefore, a higher Q value means a lower model fit. Hotelling’s T2 is the sum of normalized squared scores (Hotelling, 1947). T2 is a measure of the variation in each sample within the PCA model (Fig. 3). Figure 4 presents Q residuals versus Hotelling’s T2 plot, which is very useful to determine the outlier sample.

Fig. 3
A biplot of a P C A model. Its two principal components are samples with a large T squared unusual variation inside the model, and a large Q unusual variation outside the model. Two lines P C 1 and P C 2 intersects at the center point.

Graphical representation of the principal components space for a two-component model

Fig. 4
A graph represents Q residuals versus Hotelling’s T squared plot. It has two lines that intersect at 28,72. Samples are plotted at 88,28. 68,80. 5,85. Values are approximated.

Q residuals versus Hotelling’s T2 plot

The region of extreme samples (bottom right) exhibits unusual behavior since they adhere to the variable correlation structure recorded by the PCA model while achieving high scores in the scores space. Because they pull the PC axes toward them, these samples with high Hotelling’s T2 values are said to have strong leverage. The region far from model samples (top left): these samples, with high Q residuals values, appear to be “well behaving” when projected onto model space because they share some characteristics with the modeled category, but they are not well modeled because part of their variation is not accounted for by the model. The anomalous, extreme, and non-modeled samples that have both T2 and Q high values belong in the outliers region (top right) (Westerhuis et al., 2000).

4 Partial Least Squares Regression

Partial least squares regression (PLSR) is a regression extension of PCA, which is used to connect the information in two blocks of variables, X and Y, to each other (Wold et al., 2001). PLSR is a method of relating two data matrices, X and Y, to each other by a linear multivariate model. PLSR stands for projections to latent structures by means of partial least squares. It derives its usefulness from its ability to analyze data with many noisy, collinear, and even incomplete variables in both X and Y. For parameters related to the observations (samples, compounds, objects, items), the precision of a PLSR model improves with the increasing number of relevant X-variables. This corresponds to the intuition of most chemists, technicians, and engineers that many variables provide more information about the observations than just a few variables do (Martens & Naes, 1991).

PLSR can be seen as a particular regression technique for modeling the association between X and Y, but it can be seen as a philosophy of how to deal with complicated and approximate relationships (Geladi & Kowalski, 1986). Because PLSR considers not just the correlation between two variables but also the amount of variation in each, the criterion for defining the PLS latent variables is formulated using covariance, which is a good metric of interrelation, component-based criterion because converting it to a global loss function is quite challenging. As a result, PLSR is a sequential algorithm: PLS latent variables are computed in such a way that the first PLS component is the dependent variables’ direction of maximum covariance. The second PLS component, for example, is orthogonal to the first and has the highest residual covariance, and so on (Wold et al., 1983).

Outlier samples that are far from the center within the space given by the PLS model can be detected using plots of leverage or Hotelling’s T2. The critical limit for Hotelling’s T2 statistics is based on an F-test (Hotelling, 1992), while the critical limit for Leverage is based on ad hoc knowledge (Martens & Naes, 1991). A predicted versus measured plot should, in a good PLS model, display a straight-line relationship between predicted and measured values, ideally with a slope of one and a correlation close to one. A residual plot may be plotted against the value of the y-variable to check that the residuals are not depending on the value of Y. Outliers of various types, such as samples with significant residuals and influential samples, are commonly detected using F residuals versus Hotelling’s T2 plot. Outliers are samples with high residual variance or those that lie at the top of the plot. Influential samples are those that have high leverage, that is, those that lie to the right of the plot (Rousseeuw & Leroy, 1987). This indicates that they are attracting the model in order for it to better describe them. Influential samples are not always risky if the variables follow the same pattern as the more “average” samples. A sample with significant residual variance and leverage is referred to as a “potential outlier,” and in the presence of these outliers, the model focuses on the differences between that sample and the outlier rather than defining more general traits common to all samples.

5 Classification

Datasets are frequently made up of samples from various groups or “classes.” Groups may differ for a variety of reasons, including variations in sample preparation, chemical constituent types such as aromatic, aliphatic, etc., or process conditions. A number of approaches for classifying samples based on measured responses have been developed, as shown in Fig. 5. Cluster analysis and unsupervised pattern recognition are methods for attempting to find groups or classes without the use of prior knowledge regarding class memberships. On the other hand, classification or supervised pattern recognition are terms used to describe methods that leverage known class memberships (Ballabio & Consonni, 2013).

Fig. 5
A diagram illustrates the overview of the classification techniques in chemometrics. The techniques are classified as supervised and unsupervised. Both classifications has linear and and non-linear approaches.

Overview of the classification techniques in chemometrics

Most cluster analysis approaches are based on the concept that samples that are close together in the measurement space are comparable and so likely to belong to the same class. However, there are other ways to define the distance between samples. The most popular is the simple Euclidean distance. A Mahalanobis distance accounts for the fact that variance in some directions is substantially greater in some datasets than in others. As a result, distance in some directions is more relevant than the distance in others (De Maesschalck et al., 2000).

Soft Independent Modeling of Class Analogy (SIMCA) makes use of the model features and incorporates information about the calibration data types. A SIMCA model is made up of a set of PCA models, one for each class in the dataset (Wold, 1976). The number of major components in each class can vary. The number is determined by the data in the class. Each PCA sub-model includes all of the standard components of a PCA model, such as the mean vector, scaling information, preprocessing such as smoothing and derivatizing, and so on. The oldest and most studied supervised pattern recognition approach is linear discriminant analysis (LDA) (Fisher, 1936). It is a linear approach in the sense that the decision boundaries dividing the classes of variables in their multidimensional space are linear surfaces (hyperplanes). The purpose of LDA is to identify the ideal linear surface in a multidimensional space that corresponds to the best two-dimensional straight line. Partial least squares discriminant analysis (PLS-DA) is quite similar to LDA, another common discriminating approach. Indeed, Barker and Rayens (2003) demonstrated that PLS-DA is simply the inverse-least squares method to LDA, producing essentially the same result but with the noise reduction and variable selection benefits of PLS. PLS is used in PLS-DA to create a model that predicts the class number for each sample (Næs et al., 2002). Table 3 summarizes the advantages and disadvantages of the most common chemometric methods used in nondestructive quality evaluation.

Table 3 Advantages and disadvantages of some common chemometric methods

6 Model Validation

The most conservative validation method is to run the model on a sufficiently large representative independent test set. Several methodologies can be used to quantify sources of variation that are in principle unknown for future objects in order to make a model more robust to changes in the sample matrix, raw materials, chemical reagents, and so on (Westad & Marini, 2015). Though the goal is to have enough items to set aside a decent amount as a test set, this is not always practicable due to factors such as sample costs or reference testing. Cross-validation is the best alternative to using an independent test set for validation (Westad & Kermit, 2003).

Cross-validation is a practical and reliable way to test the significance of a PLS model. This procedure has become standard in chemometric analysis and is incorporated in one form or another in most commercial software. With CV, the basic idea is to keep a portion of the data out of the model development, develop a number of parallel models from the reduced data, predict the omitted data by different models, and finally compare the predicted values with the actual ones. The square differences between predicted and observed values are summed to form the predictive residual sum of squares, which is a measure of the predictive power of the tested model (Stone, 1974). Various ways of cross-validation is available, for example, full cross-validation, segmented cross-validation, systematic segmented cross-validation, and validating across categorical information about the objects (Kos et al., 2003).

7 Model Performances

The number of latent variables in a PLSR model is determined by minimizing the root mean square error of cross-validation (RMSECV). Given the data and number of latent variables, overfitting is a possibility, but the purely data-driven strategy is the best option. The root mean square error of prediction (RMSEP) is a direct estimate of the model’s prediction error in PLSR modeling. The RMSEP can be calculated using Eq. (3). Alternatively, the PLSR model’s accuracy and precision are represented by the bias and standard error of performance (SEP), respectively. Equations (4) and (5) can be used to compute the SEP and bias, respectively, where, and are the predicted and measured values of the ith observation in the test set and n is the size of the validation set (Amigo, 2021).

The accuracy, precision, and linearity of the models can be used to assess their performance. The root mean square error of calibration (RMSEC), RMSECV, RMSEP, and bias can all be used to express the model’s correctness. The SEP can be used to examine the PLSR model’s precision, and R2 can be used to assess linearity using a linear fit of predicted versus measured values. Low RMSEC, RMSECV, RMSEP, and SEP values, as well as a high R2 value, indicate a good model (Islam et al., 2018a).

Three equations to calculate for R M S E P, S E P and bisa.

The receiver operating characteristic (ROC) curve demonstrates the trade-off between sensitivity (or TPR) and specificity (1 − FPR). Classifiers that produce curves closer to the top-left corner perform better. A random classifier is expected to give points along the diagonal as a baseline (FPR = TPR). The test becomes less accurate when the curve approaches the ROC space’s 45-degree diagonal. The class distribution has no bearing on the ROC. This makes it ideal for testing classifiers that anticipate infrequent events like rotten products. Using accuracy (TP + TN)/(TP + TN + FN + FP) to evaluate performance, on the other hand, would favor classifiers that always predict a negative outcome for uncommon events (Fig. 6).

Fig. 6
A graph compares the true positive rate versus the false positive rate. A curve represents the actual test and a diagonal line represents no predictive value.

Receiver operating characteristic (ROC) curve

The confusion matrix and ROC curves can also be used to measure a classifier’s error rate. To understand the confusion matrix, consider a classification problem where there are two classes, X is the negative class, and Y is the positive class. And four possible outcomes: sample from class X is assigned to class X, the sample from class Y is assigned to class Y, the sample from class X is assigned to class Y (false positive), and sample from class Y is assigned to class X (false negative).

To keep track of these various outcomes, a confusion matrix (or contingency table) is utilized. The confusion matrix’s columns correspond to the samples’ actual classes, while the rows correspond to the assigned classes. The main diagonal of the matrix shows the number of correctly categorized samples in each class, while the off-diagonal elements show the number of wrongly classified samples. The off-diagonal members of the matrix are zero if the data is perfectly classified. The accuracy, sensitivity (also known as precision, recall, hit rate, or true-positive rate), false-positive rate (also known as false alarm rate), and specificity of the classification can all be determined using the confusion matrix.

Data of classification of the confusion matrix. True group X and True group Y are m and n for group X, and o and p for group Y respectively.

Equation (6) gives the accuracy of the classification, where m is the number of samples from class X that are assigned to class X by the classifier, p is the number of samples from class Y that are assigned to class Y by the classifier, n is the number of samples from class Y assigned to class X by the classifier, and o is the number of samples from class X assigned to class Y by the classifier. The sensitivity of the classification is given by Eq. (7), and the false-positive rate is given by Eq. (8). The specificity of the classification is given by Eq. (9) (Islam et al., 2018b).

Four equations. Accuracy equals m + p over m plus n plus o plus p, Sensitivity equals p over n plus p. False positive equals o over o plus m. Specificity equals m plus o.

8 Variable Selection Methods

A model that uses the entire spectral range may be at threat of overfitting, resulting in decreased predictive performance. Furthermore, a spectrum contains a significant quantity of data, most of which is unnecessary. Given the large redundancy in spectroscopic data, variable selection can typically improve chemometric models (Fig. 7).

Fig. 7
A diagram illustrates 4 variable selection methods in chemometrics. The major classifications are classical, iterative, model-based, and nature-inspired.

Common variable selection methods in chemometrics

The advantages of variable selection have been concluded in the following three aspects: (a) improve the prediction accuracy of the model because of the elimination of uninformative variables that must lead to less precision as proved theoretically; (b) selecting wavelengths probably responsible for the property of interest makes the model more interpretative; and (c) enhance the computational efficiency for modeling with a small number of variables. The advantages and disadvantages of the most common variable selection methods are presented in Table 4.

Table 4 Advantages and disadvantages of available variable selection methods

9 Multiway Analysis

In some cases, data structures are more complex than typical. Most multivariate methods are designed to work with matrices, which can be thought of as data tables. If, on the other hand, the measurements for each sample are stored in a matrix, the structure of the data is then more effectively stored in a data “box.” Such data is referred to as multiway data. If each sample produces a matrix of size M × N and there are L samples, then an L × M × N three-way array is produced. There are several methods available to deal with these kinds of three-way data (Bro, 1998).

The Generalized Rank Annihilation Method (GRAM) is a simple method with various applications; many second-order analytical procedures, such as GC-MS, are bilinear, meaning that the data may be described as the outer product of concentration profiles and pure component spectra. The main issue with GRAM is that the concentration profiles in many systems alter due to drift in the analytical apparatus (changes in the GC column in a GC-MS, for example). These changes have the potential to rapidly damage GRAM solutions. GRAM’s early implementations resulted in nonsensical fictitious solutions.

The most remarkable difference between Parallel Factor Analysis (PARAFAC) and PCA is that PARAFAC is unique in terms of scaling and permutation. Scaling ambiguity means that a column of A can be scaled by any value α as long as the corresponding column of B or C is scaled inversely i.e., by 1/α. Component one, component two, and vice versa can be called as results of permutation ambiguity. Aside from these minor uncertainties, the PARAFAC model is special in that it has only one solution. When compared to the non-uniqueness of a bilinear model, this uniqueness is the direct cause of much of PARAFAC’s popularity (Bro, 1997). If the measured data fit a PARAFAC model, the model’s underlying parameters can be calculated without rotational ambiguity.

The Tucker3 model, also known as the three-way PCA model, is among the most fundamental three-way models used in chemometrics (Tucker, 1966). The number three in the name Tucker3 refers to the fact that all three modes have been reduced. If the Tucker model is applied to a four-way dataset and all modes are decreased, the model will be called a Tucker4 model.

The so-called PARAFAC2 model, designed by Harshman (1972), is a more exotic yet extremely useful model. However, a workable method was not developed until 1999 (Kiers et al., 1999). A dataset may be ideally trilinear but not correspond to the PARAFAC model in some instances. It could be due to sampling issues or physical artifacts. Another issue arises when the array’s slabs do not have the same row (or column) dimension. It turns out that the PARAFAC2 model can be used to solve both the problem of axis shifts and the problem of shifting axis diameters in some circumstances. (Amigo et al., 2008). One of the essential features of the PARAFAC2 model is that, like PARAFAC, it is unique in some situations. The PARAFAC2 model conditions for uniqueness have received far less attention than the PARAFAC model.

10 Tools for Chemometric Analysis

To perform chemometric analyses, several MATLAB toolboxes, R packages and software’s are available. Different tools offer different functionalities. Common chemometric tools with their functionalities are listed in Table 5.

Table 5 Available tools for Chemometrics

11 Conclusion

The growth in instrumentation is causing a data overload, and as a result, a large portion of the data is “wasted,” meaning that no usable information is collected from it. The issue occurs with data compression as well as extraction. In general, laboratory and process measurements contain a lot of correlated or redundant data. This data must be gathered in such a way that keeps the relevant information while making it easier to show than each variable individually. Furthermore, crucial information is frequently found not in any particular element but in how the parameters vary in relation to each other; that is, the manner in which they co-vary. The information must be taken from the data in this scenario. Furthermore, in the presence of a lot of noise, it is always preferable to use some type of data processing. Therefore, a proper chemometric tool is essential for data cleaning, preprocessing, and extracting the most relevant chemical information from the experimental data.