Introduction

Learning is the process of seeking knowledge [1]. We, as humans, can learn from our daily interactions and experiences because we have the ability to communicate, reason, and understand. With the rapid technological advancement in computer sciences, computational intelligence has led to the development of modern cognitive and evaluation tools [2, 3]. One such tool is machine learning (ML) which is often described as a set of methods that, when applied, can allow machines to learn/understand meaningful patterns from data repositories; while maintaining minimal human interaction [4]. More specifically, a “computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” [5]. In other words, ML trains machines to understand real-world applications, use this knowledge to carry out pre-identified tasks with the goal of optimizing and improving the machines’ performance with time and new knowledge. A closer look at the definition of ML infers that computers do not learn by reasoning but rather by algorithms.

From the perspective of this work, traditional statistical regression techniques are often used to carry out behavioral modeling wherein such techniques may suffer from large uncertainties, the need for the idealization of complex processes, approximation, and averaging widely varying prototype conditions. Furthermore, statistical analysis often assumes linear, or in some cases nonlinear, relationships between the output and the predictor variables, and these assumptions do not always hold true – especially in the context of engineering/real data. On the other hand, ML methods adaptively learn from experiences and extract various discriminators. One of the major advantages of ML approaches over the traditional statistical techniques is their ability to derive a relationship(s) between inputs and outputs without assuming prior forms or existing relationships. In other words, ML approaches are not confined to one particular space that requires the availability of physical representation but rather goes beyond that to explore hidden relations in data patterns [6,7,8,9,10,11].

While ML was initially developed for computer sciences, it is now an integral part of various fields including, energy/mechanical engineering [6,7,8,9], social sciences [10, 11], space applications [12, 13], among others [14,15,16,17,18,19]. Due to the availability of high-computationally powered machines and ease-of-access to data (thanks in part to the rise of Internet-of-Things and data-driven-applications), the utilization of ML into civil engineering, in general, and materials science, engineering in particular, has been duly noted in recent years [20,21,22,23,24,25].

An integral part of the wide spread of integrating ML into new research areas is due to the availability of user-friendly and easy-to-use software packages that simplifies the process of ML by utilizing pre-defined algorithms and training/validation procedure [26,27,28,29,30]. The availability of such tools, while facilitating ML analysis and providing new opportunities for researchers often unfamiliar with the ML fundamentals with means to easily carry out such analysis, could still be misused by providing a false sense of analysis interpretation [31]. Another concern of utilizing user-ready approaches to carry out ML analysis lies in the need for compiling proper observations (i.e. datapoints). In some classical fields (say material sciences, earthquake or fire engineering) where there is a limited number of observations due to expensive tests, or need for specialized instrumentation/facilities [32], then the use of ML may lead to a biased outcome – especially when combined with lack of expertise on ML [33, 34].

An examination of open literature raises a few questions: 1) are we developing accurate ML models? 2) are such models useful to our fields? 3) are we properly validating ML models? And 4) how to confidently answer “yes” to the aforementioned questions?

A distinction should be drawn in which we need to acknowledge that, we often apply existing ML algorithms to our problems rather than developing new algorithms. This acknowledgment goes hand in hand with that similar to applying other numerical tools such as the finite element method, to investigate the response of materials and structures (say concrete beams) under harsh environments (i.e. fire conditions) [35, 36]. From this perspective, we use an existing tool, say a finite element (FE) software (ANSYS [37], ABAQUS [38] etc.), to investigate how failure mechanism occurs in a concrete beam under fire. The accuracy of this FE model is often established through a validation procedure in which a comparison of predictions from the FE model (say temperature rise in steel rebars or mid-span deflection during a fire, or in some cases, point in time when the beam fails) is plotted against that measured in an actual fire test. If the comparison is deemed well, then the FE model is said to be valid and hence can be used to explore the effect of key response parameters (i.e. magnitude of loading, strength of concrete, intensity of fire etc.). From this perspective, the validity of an FE model is established if the variation between predicted results and measured observations is between 5–15%Footnote 1 [39].

Unlike the use of FE simulation, ML is often used in two domains: 1) to show the applicability of ML to understand a phenomenon [40, 41], and 2) to identify hidden patterns governing a phenomenon [33, 42]. In the first domain, ML is primarily used to show that an ML algorithm can replicate a phenomenon – or in other words, to validate the applicability of that particular ML algorithm to a material science problem (i.e. can deep learning be applied to predict the compressive strength of concrete given that information regarding the components in a concrete mix is available?). While works in this domain showcase the diversity of ML, these also provide an additional validation platform/case studies to already well-established algorithms. The contribution of such works to our knowledge base is to be thanked and acknowledged.

The second domain is where ML shines and can be proven as a powerful ally to researchers. This is because ML strives on data and is designed to explore hidden features and patterns. The integration of these two items has not been thoroughly applied into our fields and, if applied properly, cannot only open new opportunities but also revolutionize our perspective into our fields. Unfortunately, the open literature continues to lack works in this domain, and hence such works are to be encouraged.

Whether ML is used in the first or second domain, ML models need to be rigorously assessed [43, 44]. This is a critical key to ensure: 1) the validity of the developed ML model in understanding a complex phenomenon given a limited set of data points, and 2) proper extension of the same models towards new/future datasets. Traditionally, the adequacy of ML models is often established through performance fitness and error metrics (PFEMs). Performance and error measures are vital elements in the process of evaluating ML models/frameworks. These are defined as logical and/or mathematical constructs intended to measure the closeness of actual observations to that expected (or predicted). In other words, PFEMs are used to establish an understanding of how predictions from a model compare to real (or measured) observations. Such metrics often relate to the variation between predicted and measured observations in terms of errors [45,46,47].

Diverse sets of performance metrics have been noted in the open literature i.e. correlation coefficient (R), root mean squared error (RMSE), etc. In practice, one, a multiple, or a combination of metrics are used to examine the adequacy of a particular ML model. However, there does not seem to be a systematic view into which scenarios specific metrics are preferable to use. In order to bridge this knowledge gap, this work compiles the commonly-used PFEMs and highlights their use in evaluating the performance of regression and classification ML models.

Performance Fitness and Error Metrics

This section presents the most widely-used PFEMS and highlights fundamentals, recommendations, and limitations associated with their use in assessing ML models.Footnote 2 In this work, PFEMs are grouped under two categories; traditional and modern. In this section, these reoccurring terms are used; A: actual measurements, P: predictions, n: number of data points.

Regression

Regression ML methods deal with predicting a target value using independent variables. Some of these methods include artificial neural networks, genetic programing, etc. PFEMs grouped herein belong to a group of metrics that are based on methods to calculate point distance primarily using subtraction or division operations. These metrics contain fundamental operations, either A-P or P/A, and can be supplemented with absoluteness or squareness. These are the most widely-used metrics in literature. The simplest form of common PFEMs results from subtracting a predicted value from its corresponding actual/observed value. This is often straightforward, easy to interpret, and most of all yields the magnitude of error (or difference) in the same units as those measured and predicted and can indicate if the model overestimates or underestimates observations (by analyzing the sign of the reminder). One should remember that an issue could arise where due to the opposite between predictions and observations i.e. canceling positive and negative errors. In this scenario, a zero error could be calculated, indicating false accuracy.

This can be avoided by using an absolute error (i.e. |A-P|) which only yields non-negative values. Analogous to traditional error, the absolute error also maintains the same units of predictions (and observations), and hence is easily relatable. However, due to its nature, the bias in absolute errors cannot be determined.

Similar to the same concept of absolute error, the squared error also mitigates mutual cancellation of errors. This metric can be continuously differentiable and thus facilitates optimization. However, this metric emphasizes relatively large errors (as opposed to small errors), unlike absolute error, and could be susceptible to outliners. The fact that the units of squared error is squared leads to unconventional units for error (i.e. squared days); which are not intuitive. Other metrics may also include logarithmic quotient error (i.e. ln(P/A)) as well as absolute logarithmic quotient error (i.e. |ln(P/A)|). Table 1 lists other commonly used metrics, together with some of their limitations and shortcomings as identified by surveyed studies.

Table 1 List of commonly used PFEMs for ML regression models as collected from open literature

Most of the works conducted so far in the areas of engineering applications only utilized a few of the above PFEMs [20, 33, 61, 62, 72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92]. The bulk of the reviewed works continue to incorporate traditional metrics such as R, R2, MAE, MAPE, and RMSE as primary indicators of adequacy of the regression-based ML models. This seems to stem from our familiarity with these indicators, as opposed to others; such as Golbraikh and Tropsha’s [58] criterion, QSAR model by Roy and Roy [59], Frank and Todeschini [60], and specifically designed objective functions, often used in the realms of other fields and data sciences. It should be noted that out of the reviewed studies, the works of Gandomi et al. [90], Golafshani and Behnood [40] as well as Cheng et al. [62] applied a multi-criteria verification process that incorporated the use of traditional as well as modern PFEMs. Utilizing multi-criteria is not only beneficial to ensure the validity of a particular ML model but is also recommended to overcome some of the identified limitations of traditional metrics in Table 1 and hence should be encouraged.

Classification

In ML, classification refers to categorizing data into distinct classes. This is a supervised learning approach where machines learn to classify observations into binary or multi-classes. Binary classes are those with two labels (i.e. positive vs. negative etc.), and multi-classes are those having more than two labels (i.e. types of concrete e.g., normal strength, high strength, high performance etc.). Classification algorithms may include logistic regression, k-nearest neighbors, support vector machines, etc. [93, 94].

The performance of classifiers is often listed in a confusion matrix. This matrix contains statistics about actual and predicted classifications and lays the fundamental foundations necessary to understand accuracy measurements for a specific classifier. Each column in this matrix signifies predicted instances, while each row represents actual instances. This matrix was identified to be the “go-to” metric used in studies examining materials science and engineering problems [22, 95,96,97,98]. However, there are other PFEMs that can be used to evaluate classification models, and these, along with others, are listed in Table 2. Similar to Table 1, Table 2 also lists some of the remarks and limitations pointed out by surveyed works. In this table, P (denotes number of real positives), N (denotes number of real negatives), TP (denotes true positives), TN (denotes true negatives), FP (denotes false positives), and FN (denotes false negatives).

Table 2 List of the commonly-used PFEMs for ML classification models as collected from open literature

Closing Remarks

Our confidence in the accuracy of predictions obtained from ML algorithms heavily relies on the availability of actual observations and proper PFEMs. From this point of view, it is unfortunate that observations relating to the engineering discipline continue to be 1) limited in size, and 2) lack completeness. The lack of such observations is often related to limitations in conducting full-scale tests, the need for specialized equipment, and a wide variety of tested samples. For instance, one can think of how normal strength concrete mixes can significantly vary from one study to another simply due to variation in raw materials, mix proportions, and casting/curing procedures, etc.

Combining the above two points with the notion of simply “applying ML” to understand a given phenomenon (say flexural strength of beams) without a thorough validation is deemed to fail. In fact, in many instances, researchers noted the validity of a specific ML model by reporting its performance against traditional PFEMs, only to be later identified that such a model does not properly represent actual observations – despite having good fitness. This can be avoided by adopting a rigorous validation procedure [121, 122]. Unfortunately, many of the published studies in the area of ML application in engineering do not include multi-criteria/additional validation phases and simply rely on conventional performance metrics such as R or R2 of the derived models. Furthermore, adopting a set of PFEMs does not negate the occurrence of some common issues, most notably, overfitting, biasedness etc. As such, an analysis that utilizes ML should also consider some of the following techniques e.g. use of independent test datasets, varying degrees of cross-validation etc.

In order to ensure fruitful use of ML, it is our duty to seek proper application of ML. Besides, one of the major concerns about the ML-based models is their robustness under a wide range of conditions [123]. A robust ML model should not only provide reasonable PFEMs but should also be capable of capturing the underlying physical mechanisms that govern the investigated system [124]. An essential approach to verify the robustness of the ML models is to perform parametric and sensitivity analyses [123, 125]. These types of analyses ensure that the ML predictions are in sound agreement with the system’s real behavior and physical processes rather than being merely a combination of the variables with the best fit on the data. Another item to consider is to develop a user-friendly phenomenon-specific recommendation system wherein novice users who apply pre-identified PFEMs are selected to evaluate the performance of a given problem (say using R2 in a regression problem etc.).

The reader is to remember that the addition of one example to showcase recommended or important PFEMs negates the purpose of this paper (which is to compile commonly used performance metrics and list their key characteristics into one document to provide interested researchers in carrying out a ML analysis with a starting point to select proper performance metrics). Providing a comparison for all of the reviewed metrics will significantly extend this work beyond its scope and may not be feasible at the moment. We feel that this is best suited for a series of more in-depth reviews wherein metrics for classification and regression problems can be separately evaluated and reviewed under well-designed problems and a variety of conditions to ensure fairness and unbiasedness to come in the near future.

It is our intention to not specifically identify a measure (or a set of measures) due to the wide range of problems (as well as the quality of data) that a scientist could face. Please note that other researchers (which are quoted herein) also followed a similar approach.

  1. o

    “Although some methods clearly perform better or worse than other methods on average, there is significant variability across the problems and metrics. Even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well.” [126].

  2. p

    “It is clearly difficult to convincingly differentiate ML algorithms (and feature reduction techniques) on the basis of their achievable accuracy, recall and precision.”[127].

  3. q

    “Different performance metrics yield different tradeoffs that are appropriate in different settings. No one metric does it all, and the metric optimized to or used for model selection does matter.”[102].

Conclusions

Based on the information presented in this note, the following conclusions can be drawn.

  • ML is expected to rise into a key analysis tool in the coming few years; especially within material scientists and structural engineers. As such, the integration of ML is to be thorough and proper. Hence, the need for proper validation procedure.

  • A variety of performance metrics and error metrics exists for regression and classification problems. This work recommends the utilization of multi-fitness criteria (where a series of metrics are checked on one problem) to ensure the validity of ML models as these metrics may overcome some of the limitations of induvial metrics. Such metrics can be of independent nature to each other such as, R2, RSME, and a20−index.

  • The performance of the existing metrics and future fitness functions can be further improved through systematic collaboration between researchers of interdisciplinary backgrounds. For example, efforts are invited to identify and recommend metrics suitable for specific problems and datasets.

  • Future works should be directed towards documenting and exploring performance metrics for other types of learnings such as unsupervised learning and reinforcement learning. This is ongoing research need that is to be addressed in the coming years.