Abstract
We provide a uniform, general, and complete formal account of evaluation metrics for ranking, classification, clustering, and other information access problems. We leverage concepts from measurement theory, such as scale types and permissible transformation functions, and we capture the nature of evaluation metrics in many tasks by two formal definitions, which lead to a distinction of two metric/tasks families, and provide a comprehensive classification of the tasks that have been proposed so far. We derive some theorems to analyze the suitability (or otherwise) of some common metrics. Within our model we can derive and explain the theoretical properties and drawbacks of the state of the art metrics for multiple tasks. The main contributions of this paper are that, differently from previous studies, the formalization is well grounded on a solid discipline, it is general as it can take into account most effectiveness metrics as well as most existing tasks, and it allows to derive important consequences on metrics and their limitations.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The number of evaluation effectiveness metrics in information access tasks is very large, and growing: in Information Retrieval (IR) alone, more than 100 effectiveness metrics exist, not taking into account the user oriented and Web-oriented ones (Amigó et al. 2014); in clustering various accuracy measures are used, even in official experimental initiatives and sometimes with undesirable properties (Amigó et al. 2009); in filtering the situation is analogous (Amigó et al. 2011); and of course, when considering other tasks the number of used metrics grows even more.
Being the evaluation scenario so rich and complex, it is not surprising that attempts have been made to understand, model, and formalize it. More in detail, researchers have defined formal properties (or axioms, or constraints) that must be satisfied by metrics. This has happened both in the early years (Van Rijsbergen 1974; Bollmann 1984) and more recently, with a renewed interest and several studies published in the last 5 years or so (Amigó et al. 2009, 2011, 2013, 2014, 2015; Moffat 2013; Busin and Mizzaro 2013; Maddalena and Mizzaro 2014; Ferrante et al. 2015; Sebastiani 2015; Ferrante et al. 2017). All these studies have in common a formal attitude: to try to understand in a formal way properties of effectiveness metrics. This paper follows the path of these studies, and addresses the formalization of effectiveness evaluation.
As it will be detailed in the following, this paper differs from previous studies in several respects. First, we aim at a more general approach: we do not focus on a specific task only, as others have done, but we take into account several information access tasks at the same time and we provide a uniform account. In this respect, one issue is that the number of tasks in information access is very large, and their variety is quite high: researchers build systems, for example, to classify tweets, cluster terms, retrieve documents, filter news, recommend movies, summarize texts, etc. To be able to provide a general and systematic account, we make two choices. First, we focus on the tasks that can be modeled by the assignment of a value to each item; most of the above mentioned examples match this description (the only exception being summarization). Second, we make an abstraction effort and we distill these many existing tasks into just four: Classification (assigning a category to each document), Clustering (organising documents into groups), Ranking (sorting documents), and Quantitation (assigning a numeric value to each document). In an attempt to avoid confusion, we try to use a specific terminology: we use the term abstract task to refer to the latter four only (i.e., Classification, Clustering, Ranking, and Quantitation) and we use the term task to refer to all the former ones. Although the terminology is somehow different from what commonly used in the literature, we believe that the small effort is worthwhile, and it allows us to precisely define the four abstract tasks we are addressing.
A second difference from previous studies is that we ground on measurement theory, that provides several useful notions including the measurement scale. We are not by any means the first ones to use measurement theory as a tool. However, we use it in a way that is different from previous work, and that is important to state upfront also to avoid confusion. It is probably intuitive to directly link the notion of metric with measurement: a metric measures system accuracy/effectiveness. This is the approach of the seminal work by Van Rijsbergen (1974) and also of more recent proposals by Ferrante et al. (2017, 2019). However, we do not do so, and in our approach measurement theory is used in a different way: our starting point is the fact that both system outputs and gold standards can be seen as assignments of values to documents, for all four abstract tasks. For example, the output of a classification system can be seen as an assignment of values on the nominal scale type; a rank of documents (a typical search engine output) can be seen as an assignment of values to documents, to be interpreted on the ordinal scale type (i.e., considering only the rank induced by the assigned values); and so on. Thus, measurement theory allows us to model both system outputs and gold standards as assignments of values as well as to state a direct relationship between the abstract tasks and the scale types.
Then, coming to the third difference from previous studies, in our framework an evaluation metric compares two assignments of numbers to documents: one provided by a system, and another one provided by human assessments. We define the effectiveness/accuracy of the system as the closeness of the two assignments. In other terms, measurement theory allows us to distinguish the above four abstract tasks on the basis of the scale used when assigning the values. For example, the nominal scale is used for classification and clustering (in which the task can be modeled as assigning values whose only important properties are equalities and inequalities), the ordinal scale for ranking (in which the order of the values is the important property), and the interval and ratio scales for quantitation (in which differences and ratios are important, respectively).
But we need to add a notion of closeness, that measurement theory does not include. Indeed, we will define two different kinds of closeness: this will allow us to take into account all four abstract tasks, and the related metrics, by just varying two parameters (measurement scale type and kind of closeness). The need for two different kinds of closeness will be detailed in the following, but can be immediately understood at an intuitive level by observing that for both categorization and clustering the scale type is nominal, but the two are clearly different: in the former the values used in the assignments are important (otherwise a misclassification occurs), whereas in the latter the equality and inequality relations are important, so any measurement which is equivalent for the nominal scale type (another notion provided by measurement theory) is adequate.
So, to summarize, we have three aims in this paper. First, to provide a general definition of metric valid for different abstract tasks such as classification, clustering, ranking, or quantitation (i.e., value prediction). This definition is based on measurement theory, but measurement theory is not enough and we also need to capture the concept of closeness, or proximity; we define two kinds of closeness. Second, to make explicit a correspondence between the four information access abstract tasks and a two dimensional space defined by: (i) the family of the metric (defined on the basis of two different approaches to closeness, based on values and equivalence), and (ii) the scale type to be used to analyse the correspondence between system and gold. Third, to state some theorems showing how our general definitions and axioms specialise into the metric properties defined in the literature for each particular information access task. Thus, the theoretical limitations of specific metrics that have been derived from the basic axioms in the literature can also be derived in our framework.
This paper is structured as follows. In Sect. 2 we extensively survey the previous attempts to formalize metrics properties across various abstract tasks. We then turn to defining our framework, which is based on measurement theory. The reader can find in Appendix A a basic background in measurement theory, a well settled discipline that provides the foundations and tools for our formalization. We include basic definitions and some examples. We ground on measurement theory to generalize the notion of evaluation metric in terms of closeness at specific scale types, as discussed in Sect. 3, where two kinds of closeness are defined. Section 4 focuses on effectiveness metrics: we distinguish two families of metrics; for each of the two we provide a formal definition and state some properties as axioms. In Sects. 5 to 7 we exploit the framework: we state some theorems, that formally capture both general metric properties already proposed in the literature and properties of specific metrics, and that can be derived as particular cases from the definitions and axioms presented in our framework (proofs are in Appendix B). We finally show the generality of our framework by applying it to novel metrics and tasks in Sect. 8. Section 9 summarizes the main results, discusses consequences, assumptions and limits of this study, and sketches future work.
2 Related work
Several authors have proposed formal properties of metrics by defining axiomatics focused on a specific task. Terminology needs some clarification. Different authors have used “properties”, “constraints”, or “axioms”, and there is even some debate on which term is correct. In this paper we privilege the last term, although sometimes we also use as synonym one of previous two, especially when describing previous work (that we do usually by using the same terminology as in the original). As already mentioned, we distinguish between the concepts of information access task and abstract task. The former is related with the user context and goals. Examples of tasks could be: searching web pages for generating a report about a topic, recommending products for online sales, spam filtering, sentiment analysis over tweets for reputation analysis, novelty detection in news, etc. All these tasks share the common characteristic that they consists of organising information items (web pages, mails, etc.). In this paper we refer to information items as documents.Footnote 1 An abstract task can be seen as an attempt of formalising the tasks and/or their basic components, and it is related with the characteristics of system outputs and goldstandards (i.e., human judgments/assessments, also called simply golds). We focus on the four abstract tasks listed above (Categorization, Clustering, Ranking, and Quantitation), and we survey the main approaches, grouped by abstract task. We also note some details that are useful in the following of the paper. We call basic axioms those common to different authors and somehow related to the abstract task but independent from the task. Tables 1, 2, and 3 list the main properties and can be a useful reference; some of the notes in the tables are described in Sect. 6.
2.1 Classification axiomatics
Some authors group classification metrics according to their properties. For instance, Ferri et al. (2009) discriminated between probabilistic measures (which consider the deviation from the true probability of errors) and measures based on a qualitative understanding of errors (which focus on the idea of utility). The authors do not offer a formal distinction between probabilistic and qualitative measures.
More recently, Sebastiani (2015) proposed eight axioms. The first one is the Strict Monotonicity axiom: it states that, given two classification outputs such that they only differ on one decision, i.e., the category of a document, then if one of them is correct on that document, it must be reflected in an increase in its metric score. This idea is also captured by Sokolova’s (2006) properties (see below). Sebastiani proved that the traditional F-measure (based on Precision and Recall) does not satisfy this property, as it fails when components of the contingency matrix have zero value. A similar problem is identified for the metric Lam% (Qi et al. 2010). Note, however, that zero values in components of the contingency matrix represent in general a very particular situation. We provide a slight generalization of MON into a Generalized Strict Monotonicity Axiom (GMON) in the following sections.
Sebastiani’s second axiom, Continuous Differentiability, states that the evaluation measure must be continuous and differentiable over the true positives and true negatives. According to the author, measures fail to satisfy this axiom, again, in the case of zero values in the contingency matrix. Something similar happens with his third and fourth axioms, Strong Definiteness and Weak Definiteness, which state that the measures must be definable under any gold or system output. One might argue that the Strict Monotonicity axiom subsumes these two axioms, because the metric score of every system output must be definable in order to produce a score increase in the Strict Monotonicity conditions.
The fifth axiom (Fixed Range) sets a restriction about the measure value range. The sixth and seventh axioms (Robustness to Chance and Robustness to Imbalance) are related to the idea of probabilistic measures proposed by Ferri et al. (2009), and state that random or trivial classifiers must achieve the same score regardless the goldstandard. Measures such as Accuracy, Utility or F-measure (Precision and Recall), although widely adopted, do not satisfy this property. The reason is that actually, there are situations and user contexts in which not every trivial or random classifier has the same effectiveness. For instance, putting randomly just a few mails in the spam directory is less problematic for the user than putting most of mails. The F-measure tackles this aspect by returning a fixed precision for any random output while increasing recall when returning most of e-mails to the user. The eighth and last axiom, Symmetry, “enforces the notion that the evaluation measure should be invariant with respect to switching the roles of the class and its complement”. Thus, replacing the positive samples by negative ones in both the system output and the gold produces the same classification score. However, again, this axiom is task-dependent and, therefore, not every metric is designed to satisfy it. For instance, class oriented metrics, such as the F-measure combination of Precision and Recall, do not satisfy it. Utility metrics assign a utility weight to each class, so they do not satisfy it. A system correctly labeling as spam 8 out of 10 spam messages would be useful to the user, even if not perfect, but a system correctly labeling as non-spam 8 out of 10 non-spam messages would probably be unacceptable. For this reason, non symmetric metrics such as class oriented metrics are employed in these tasks.
In an earlier paper, Sokolova (2006) proposed a formal categorisation according to a set of properties based on the invariance of measures under a change in the contingency matrix (TP, TN, FP, FN, i.e., True Positive, True Negative, False Positive, and False Negative, respectively).
These properties have a correspondence with axioms proposed by other authors. The invariance under the swap of TP with TN and FP with FN corresponds to Sebastiani’s Symmetry axiom. The second property is the invariance under the change in TN when all other matrix entries remain the same. This property characterises the class-oriented measures (Precision and Recall). If the measure is not sensitive to one of the components, then increasing the amount of returned documents when the document classification is random, can be always beneficial. This property is complementary to the sixth and seventh Sebastiani’s axioms; thus, this property is also task-dependent. The next property is the invariance under the change in FP when all other matrix entries remain the same. The non-invariance is necessary if the measure satisfies the Strict Monotonicity axiom. The fourth and last property is the invariance under the classification scaling. According to the author, it is only satisfied by Precision, which is a partial measure that does not satisfy the Strict Monotonicity axiom.
In summary, we see that some axioms and properties are related, equivalent, or subsumed, and others are task-dependent. The only exception we find is the Strict Monotonicity axiom, which is common across all authors and is generally satisfied by most metrics. According to this analysis, we consider Strict Monotonicity as the unique commonly accepted basic axiom for the classification abstract task.
2.2 Clustering axiomatics
Dom (2001) proposed five formal desirable properties for clustering metrics. These were later extended to seven by Rosenberg and Hirschberg (2007). Basically, this axiomatics consists of stating a bijective correspondence between clusters and classes in the gold. It assumes a set of useful clusters with high correspondence with classes (peer to peer) and a set of noisy (small) clusters. The axioms state that increasing the amount of noisy clusters, splitting or joining useful clusters decrease the score. These properties are implicitly subsumed by the Generalized Homogeneity/Completeness axiom that we instroduce below, given that these movements require breaking correct relationships and increasing the amount of incorrect relationships.
Meila (2003) proposed an entropy-based metric (Variation Information) and listed twelve desirable properties associated with it. Most of these properties are not directly related to the quality aspects captured by a metric, but rather to other intrinsic features such as the ability to scale or computational cost (Amigó et al. 2009). The exceptions are the properties 4 and 7, related with Cluster size versus Quantity (task-dependent), and properties 10 and 11, related with Completeness.
Amigó et al. (2009) proposed an axiomatics consisting of four constraints that, by focusing on extreme situations in which one system output should outperform another, capture the essence of previously proposed axiomatics (Dom 2001; Meila 2003):
-
Cluster Homogeneity: given a certain system output document distribution, splitting documents that do not belong to the same class must increase the output quality. This restriction was first proposed by Rosenberg and Hirschberg (2007). Although it seems a very basic constraint, measures based on editing distance do not satisfy it (Amigó et al. 2009).
-
Cluster Completeness: the counterpart to the first constraint is that documents belonging to the same class should be in the same cluster. This intuition is also captured by Dom’s constraints. Measures based on set matching, such as Purity and Inverse Purity, do not satisfy this contraint.
-
Rag Bag: introducing disorder into a disordered cluster (rag bag) is less harmful than into a clean cluster. In general, all traditional measures fail to comply with this constraint. However, it can be considered task-dependent, for example, in an early alarm detection task, considering a few related messages in isolation could be crucial.
-
Cluster size versus quantity: a small error in a big cluster is preferable to a large number of small errors in small clusters. This constraint prevents the problem that measures based on counting pairs (Amigó et al. 2009; Meila 2003; Halkidi et al. 2001) overweight big clusters (these measures are sensitive to the combinatory explosion of pairs in big clusters, and fail on this constraint). Although this principle is shared by several authors (Amigó et al. 2009; Dom 2001; Meila 2003), we can consider a task in which this axiom is not mandatory, as one could be interested in penalizing errors in large clusters more than multiple errors in small clusters.
Cluster Homogeneity and Cluster Completeness constraints can be generalized into a single one: given two identical clustering outputs with the exception that the second output contains correct clustering relationships that do not appear in the first output, or it does not contain incorrect clustering relationships that appear in the first system, then the metric must strictly increase. In the following sections we formalise this as Generalized Homogeneity/Completeness (GHC).
In summary, we can conclude that the basic axioms which are shared by different analyses are Completeness and Homogeneity which can be generalized into a unique axiom.
2.3 Ranking axiomatics
Most of the work on ranking axiomatics has been developed in the context of IR (as discussed in the first part of this section). There has been some discussion about ordinal classification too (as discussed in the last part). It is however important to understand that, by being focused on those concrete tasks, researchers have proposed, if not taken for granted, some properties, that do not apply for the abstract task of ranking. For example, it might sound surprising but top-heaviness is not mandatory for the abstract task of ranking. It is possible to imagine concrete ranking tasks that do not reward correctness in earlier rank positions more than in later ones: to provide a simple example, if one has to evaluate an approximate algorithm alphabetically sorting an array, Kendall’s Tau correlation would be a reasonable measure, although it is not top-heavy — and indeed it would not make any sense to reward the sorting algorithms that are more correct for “A” than for “Z”.
In one of the early works on formalizing IR evaluation, Van Rijsbergen (1974) already suggested to use measurement theory (as we do extensively in this paper). However, that seminal work does not state formal properties for simple evaluation measures, but for the combination of them (e.g., the well known F-measure), using the Conjoint Measurement Theory. Other authors tried to exploit measurement theory and to define a notion of similarity between measurements when formalizing IR (Busin and Mizzaro 2013; Maddalena and Mizzaro 2014; Ferrante et al. 2015). This paper extends that idea, but using measurement theory to state definitions and axioms for evaluation measures equally valid for different information access abstract tasks, not just IR.
Moffat (2013) listed seven properties that IR metrics should satisfy. The first one is Boundedness. It is not about quality; it requires the existence of a bounded range of scores. The second one is called Monotonicity and states that if a ranking of length k is extended so that \(k+1\) elements are included, the metric value never decreases. This second property is task-dependent: in some situations, reducing the size of the returned document list could be useful, for instance if it contains only irrelevant documents. In addition, according to the author, this property gets in contradiction with the next one, the Convergence property, that reflects the basic principle that relevant documents must occur above irrelevant documents in the system output ranking. The property states that swapping two documents in the ranking in concordance with the relevance judgements strictly increases the metric scores. The fourth property, Top-weightedness, explicits that in IR the first positions in the rank are the most important ones. As anticipated at the beginning of this section, this constraint, although widely accepted, is task-dependent, because it is related with the cost of exploring the ranking produced by the system, i.e., the probability that the user actually explores each ranking position. Usually it is reasonable to assume that this probability decreases by going down the rank, but one might imagine situations where this is not true and a very persistent user explores all the ranking positions. The fifth one is referred as Localisation: a metric value at a given rank position k should depend only on the documents in the first k positions. This axiom can be applied only to metrics that assume a certain deepness threshold as input parameter. That is, it is not a general axiom. The sixth property, Completeness, states that the metric must be definable even when there are no relevant documents in the collection. Finally, Realisability states that the maximal score can be achieved even when there exists only one relevant document in the collection. This property is related with the normalisation of scores across test cases.
Ferrante et al. (2015) proposed two axioms: Replacement (replacing an irrelevant document in the output ranking with a relevant one increases the score) and Swapping (swapping two documents in the ranking output in concordance with the gold annotation relative relevance increases the score). In fact, if we assume that documents out of the output ranking are located together in an additional last ranking position, then Replacement is subsumed by Swapping. In addition, Swapping is equivalent to Convergence.
Amigó et al. (2013) proposed five axioms. The first one (Priority (PRI)) states again that swapping documents in a correct way increases the score. It is similar to Ferrante’s Swapping axiom, and equivalent to it, although somehow more relaxed since it requires a metric score increase only when the ranking position of the swapped documents are contiguous. The next three axioms are related with assumptions about the user task: deeper positions in the ranking are less likely to be explored (Deepness axiom); there is an area at the top of the ranking which is always explored by the user (Closeness Threshold axiom); and there is an area deep enough that is never explored by the user (Deepness Threshold axiom). The last two axioms are formalised in terms of comparing one relevant document in the first position, n relevant documents in the 2n first ranking positions, and a huge amount of relevant documents after a huge amount of irrelevant documents. The fifth axiom states that given a ranking containing only irrelevant items, the shorter the ranking the higher the metric score (Confidence axiom). This axiom is also task-dependent. We could think that reducing the ranking length, instead of avoiding user effort, adds uncertainty about the possibility of finding relevant documents at lower ranks.
In summary, according to our analysis, most of the axioms and properties are task-dependent. The basic axioms, common to all studies, are related with the correctness of priority relationships: the Swapping constraint proposed by both Ferrante et al. (2015) and Moffat (2013), and its relaxed version, the Priority axiom proposed by Amigó et al. (2013).
We also briefly mention the task of assigning items to categories which have an order, named Ordinal Classification: is a sort of mixture between classification and ranking. It is a quite popular task in some situations: besides the common assignment of “stars” in several Web reviews sites, let us consider the polarity detection task, in which text fragments must be classified in terms of sentiment analysis according to a few categories such as “Very positive”, “Positive”, “Neutral”, “Negative” and “Very negative”. The task is defined in an ordinal manner, but the perfect system should return exactly the same values, so it is not a ranking problem like traditional IR. Then, some evaluation campaigns have applied a classification oriented metric, such as in Barbieri et al. (2016). In other campaigns, systems were evaluated with ranking metrics, such as in Amigó et al. (2013); in other ones, different tasks were defined in concordance with different metrics (ranking or classification), such as in Rosenthal et al. (2017).
We can see a similar situation in semantic textual similarity (Agirre et al. 2015) in which text pairs must be categorised according to a few classes (high / average similarity, etc.): the organizers used Pearson correlation. The prediction of stars in product recommendation also fits into this case: sorting products in a correct way is desirable, as well as assigning the correct amount of stars. In fact, recommender systems were initially evaluated in terms of accuracy, before the community began to work under ranking based metrics. We can find a few studies in the literature that analysed the behavior of some traditional metrics in this task such as Mean Average Error, Mean Squared Error, linear correlation, or Accuracy with n (Gaudette and Japkowicz 2009), proposed a method to make Ordinal Classification metrics robust to imbalance (Baccianella et al. 2009), and analysed the suitablity of traditional metrics by means of a particular case and proposed the Ordinal Classification Index metric (Cardoso and Sousa 2011). However, differently from the previous abstract tasks, these studies do not define properties to be satisfied, and in general there is not a clear choice of the metric to be used when predicting a few labels that keep an ordinal relationship.
2.4 Quantitation axiomatics
Quantitation,Footnote 2 i.e., the abstract task of assigning numeric values to documents, is perhaps less common, but it is not only theoretical and some examples can be found. In the Semantic Textual Similarity task at SemEval-2016 (Agirre et al. 2016), systems are asked to return a numeric value predicting the similarity between two snippet of texts. In some sentiment polarity detection tasks, the absolute polarity values returned by systems are compared with the reference values. In both cases the stated goal consists of maximising the linear correlation between system outputs and golds. The most frequent metric in these cases is the Pearson coefficient, although this choice is being criticised (Reimers et al. 2016) and might change in the future.
In other cases the goal consists of predicting the exact values. One example is the proposal to evaluate information retrieval systems not only on the basis of the rank of the retrieved documents, but on the basis of the numeric relevance values assigned to document. With this approach, and assuming a continuous notion of relevance, Della Mea and Mizzaro (2004) proposed ADM (Average Distance Measure). More recently, magnitude estimation has been proposed as a technique to gather relevance assessment on a ratio scale (Maddalena et al. 2017). One might even claim that there is a trend to go beyond the classical (category relevance and ranking retrieval) situation, although we are not aware of any attempt to capture the properties of metrics for quantitation. Therefore we include this abstract task in our analysis.
3 Measurement theory and closeness
Measurement theory is briefly recalled in Sects. 3.1 and 3.2 (for more details see Appendix A), where we also discuss the notion of closeness. We then outline the structure of our framework in Sect. 3.3. Sections 3.4–3.6 provide some definitions, and Sect. 3.7 provides an example (and the reader might find it useful to go back to Sects. 3.4–3.6 after having read it).
3.1 Measurement theory
Appendix A recalls some of the basic concepts of measurement theory, that we assume as known in the following: assignment, measurement, scale types and permissible transformation functions, equivalence, and meaningfulness (see Definitions A.1–A.4 and A.6). The reader familiar with measurement theory can probably just skim the appendix to get acquainted with our notation, or maybe even skip it and refer to specific parts when needed.
Briefly, measurement theory studies the properties of value assignments to objects like, for instance, temperature, height, and distance. At the core of measurement theory there is the notion of scale type. The classical scale types are nominal, ordinal, interval, and ratio. At each scale type, some relationships between values make sense and others are not meaningful. For instance, at the nominal scale type only equality and inequality of values can be taken into account, whereas considering the ratio or the interval between values makes no sense (e.g., “red” divided by “green”). For each scale type, a set of permissible transformation functions is defined. These, when applied to a measurement, determine which are the measurements that, though different, are equivalent, i.e., carry the same meaning. For example, for the ordinal scale type, value assignments that order objects in the same way are equivalent, and the set of permissible transformation functions are the strictly monotonically increasing functions.
This matches with our abstract tasks: in general, we can say that systems assign values to items (relevance, topics, categories, priority, etc.), and evaluation consists in comparing the system output assignment against the human annotated assignment (gold). As an example, suppose that we want to categorize some documents into “physics”, “biology”, and “social sciences”. This is a classification problem, and there are no ordinal or interval relationships between classes. The goal consists in maximizing the amount of equalities between system output values and gold values, and there is no consideration about the range or interval of errors. In other words, the predicted categories are either accurate or not. In other terms, classification problems can be mapped onto the nominal scale type. Similarly, ranking problems can be mapped onto the ordinal scale type, and quantitation problems onto the interval and ratio scales. However the situation is slightly more complex; this can be seen when considering that clustering can be mapped onto the nominal scale type as well, but it is an intrinsically different problem from classification. Anticipating what we will discuss in detail in the following, we model this difference on the basis of the kind of closeness that can be defined between two measurements: for classification we seek for equal measurements; for clustering we seek for equivalent measurements. We will also discuss that it is not even necessary to assume that system outputs and golds are measurements; assignment is enough.
Incidentally, we remark that by exploiting the basic concepts of measurement theory, like scale types and meaningful statements, it would be possible to directly derive some consequences on effectiveness metrics. For example, the well known Mean Reciprocal Rank (MRR) metric is computed by considering the rank of the first relevant document, computing its reciprocal, and averaging these values. But by doing so, one is neither applying permissible transformation functions, nor deriving meaningful statements, for the scale type at hand, which is the ordinal one; a similar remark has been made by Fuhr (2018). Also the widely adopted Normalized Discounted Cumulative Gain (nDCG) metric transforms an ordinal relevance scale (e.g., Highly relevant, Relevant, Marginally relevant, Not relevant) into numerical gains which are on a ratio scale type, and this is intrinsically arbitrary.
But we believe that the consequences of measurement theory on evaluation are deeper; in the rest of the paper we discuss those more foundational aspects.
3.2 Closeness
Measurement theory focusses on equivalence rather than closeness. For instance, according to Suppes and Zinnes (1963) the two first fundamental problems in measurement theory are the Representation Theorem and Uniqueness. Both are related with finding homomorphisms between the empirical and numerical structures, thus studying which assignments are indeed measurements. The third problem is the Meaningfulness Problem (see Sect. A.4). There is nothing that discusses the similarity, or distance, or closeness of two assignments or measurements. However, the main goal of evaluation is to compare system outputs against gold standards or references, which are almost never equivalent. In fact, equivalence would mean perfect effectiveness, which is an extremely rare event in information access. So, a simple binary comparison (equivalent versus non equivalent) is not enough, and, in this sense, the concept of closeness between two assignments (and, consequently, measurements) needs to be added to measurement theory concepts to formalise the evaluation scenario. As we will see in Sect. 4, this will allow to define the notion of evaluation metrics. Going back to the classification problem of the example in Sect. 3.1, this means that we need to model the closeness between system output values and gold values at the nominal scale type.
As another example, suppose that we are interested in grouping documents correctly. That is, the system must assign to documents about physics the same value, but different from biology documents. In this case, the document tags generated by the system (assigned values) are not necessarily equal to the category identifiers provided by humans in the gold. For this, we can exploit the notion of equivalence in measurement theory. That is, two assignments are equivalent at the nominal scale type if they keep the same equality relationships between objects. For instance, the assignments (\(A=1\), \(B=1\), \(C=2\)) and (\(A=3\), \(B=3\), \(C=1\)) are equivalent at the nominal scale type, given that, in both cases, \(A=B\ne C\) is true. This matches with our clustering problem. In terms of measurement theory we can say that the goal of clustering evaluation is to quantify how close is the system output to be equivalent to the gold at the nominal scale type.
Let us finally consider the abstract task of sorting documents according to their relevance, the classical document ranking problem. In this case, the evaluation must consider the ordinal relationships between assigned values, while the interval or ratio between assigned values is not important. That is, documents must be sorted in the same way as in the gold, i.e., relevant documents earlier than irrelevant documents. In terms of measurement theory, the goal of ranking is generating a relevance assignment equivalent to the gold. We can interpret document ranking as an ordinal equivalence oriented problem.
We state two general definitions of evaluation metric, value-oriented and equivalence-oriented, and we prove that, depending on the scale type, existing metrics and their desirable properties defined in the literature for each abstract task fit into the corresponding definition. But then, more situations appear spontaneously in the proposed model. For instance, in polarity detection, (positive, neutral, negative) categories present an ordinal relationship, while the intervals between polarity levels are undefined. However, unlike in ranking problems, predicting the specific polarity category of documents must be rewarded. This is an Ordinal Classification problem (see Sect. 2.3), but there is not a consensus about how this problem must be formalized. Our framework identifies it as a value-oriented ordinal problem, fitting in our definition.
The proposed model goes further, and includes abstract tasks at interval and ratio scales. The model also allows us to analyze evaluation metrics, and to understand when they can be used appropriately. For example, we prove in this paper that Pearson coefficient fits into our equivalence-oriented metric definition at the interval scale type, the popular cosine distance fits into the equivalence-oriented metric definition at the ratio scale type, and the commonly used error rate or mean average error fit into the value-oriented metric definition at the ratio or interval scale type.
3.3 Structure of the framework
By exploiting some basic results of measurement theory, as well as the notion of closeness, we now turn to defining our framework. Figure 1 sketches the overall structure of the framework and can be a useful reference in the following. We ground on some definitions from measurement theory (orange boxes with dotted borders); we have briefly recalled them above but the reader is referred to the corresponding Definitions A.1–A.6 in Appendix A. Then, we analyse the notion of proximity, or closeness (dashed green boxes). We start from a simple but general definition of closeness (Definition 1) and we particularize it for the four classical scale types (Definition 2). On this basis, we define two kinds of closeness measures for assignments, depending on whether they focus on value matching (Definition 3) or on measurement equivalence (Definition 4). These definitions can be particularised for any scale type: nominal, ordinal, interval, or ratio. Then, moving to the yellow boxes, we define system outputs and golds as assignments of numeric values to items (Definition 5); let us remark that this is done for compatibility with measurement theory and without loss of generality as long as we consider the abstract tasks. Finally, the definition of metric in general is directly derived from the notion of closeness (Definition 6), as well as the the two specific definitions of value- and equivalence-oriented metrics (Definitions 7 and 8).
3.4 Defining closeness for a scale type
In this subsection, we start from a simple but general definition of closeness between single values, and then we propose a definition dependent on the scale types. Exploiting these definitions, a more intuitive description of closeness for the scale types is then derived in Lemma 1; this will be used to define closeness between assignments in the next subsections.
A first important remark is that it is possible to define different distance functions, and therefore, multiple closeness notions. There is some unavoidable arbitrariness. A second remark is that the closeness between values also depends on the scale type of reference. For the nominal scale type, two values are similar when they are equal (e.g., since \(3\ne 4=4\), 4 is more similar to 4 than 3). For the ordinal scale type, we observe relative closeness when values are in sequence (e.g., since \(3<4<5\), 3 is more similar to 4 than to 5). For the interval scale type, when the absolute difference is lower (e.g., since \(|3-5|<|1000-5|\), 3 is more similar to 5 than 1000). Notice that due to the subsumption effect across scale types (Formulas (30) and (31)), each assertion is valid for the other higher scale types (e.g., if \(3\ne 4=4\) then \(3<4\le 4\) and \(|3-4|>|4-4|\)).
For our framework a simple definition of closeness is sufficient, based on a distance between two values \(x ,y \in {{\,{{\mathbb {R}}}\,}}\) computed as \(|x-y|\). Moreover, we are interested in comparing closeness values, not in precise closeness values. Therefore we do not define a function returning a closeness value, but simply the following relationship.
Definition 1
(Closeness) Let \(x, y, r \in {{\,{{\mathbb {R}}}\,}}\). We say that x is (strictly) closer to a reference value r than y, and we write \({x} \mathrel {{\preccurlyeq }^{r}} {y}\) (\({x} \mathrel {{\prec }^{r}} {y}\) for the strict case), if and only if:Footnote 3
We now particularize closeness for a certain scale type. We refer to the four classic scale types, i.e., Nominal (denoted with \(\mathtt {N}\) from now on), Ordinal (\(\mathtt {O}\)), Interval (\(\mathtt {I}\)), and Ratio (\(\mathtt {R}\)). There is a natural order on the scale types, going from the lowest scale type \(\mathtt {N}\) to the highest scale type \(\mathtt {R}\). This order is derived from the inclusion chain of the permissible transformation functions of the four scale types: if \({\mathcal {F}}_{\mathtt {T}}\) denotes the set of permissible transformation functions for the scale type \(\mathtt {T}\), then \({\mathcal {F}}_{\mathtt {R}}\subset {\mathcal {F}}_{\mathtt {I}}\subset {\mathcal {F}}_{\mathtt {O}}\subset {\mathcal {F}}_{\mathtt {N}}\). This allows us to write \(\mathtt {N}< \mathtt {O}< \mathtt {I} < \mathtt {R}\) and to speak of higher and lower scale types accordingly. See Appendix A for further details.
Definition 2
(Closeness for a scale type) Let \(x, y, r \in {{\,{{\mathbb {R}}}\,}}\), and \({\mathcal {F}}_{\mathtt {T}}\) the set of permissible transformation functions for the scale type \(\mathtt {T}\). We say that x is closer to a reference r than yfor a certain scale type\(\mathtt {T}\), and we write \({x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y}\) if and only if it is closer for at least one permissible transformation function in \({\mathcal {F}}_{\mathtt {T}}\):
The associated strict relationship is:
Given a fixed reference r, non-strict closeness is a binary relationship that satisfies reflexivity (\(\forall x \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {x} \right)\)), transitivity (\(\forall x,y,z \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y} \wedge {y} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {z} \Longrightarrow {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {z} \right)\)), and is a connex relation (\(\forall x,y \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y} \vee {y} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {x}\right)\)). Non-strict closeness is not a total order since it is not anti-symmetric, as two different values can be equally non-strict close to the reference. The strict closeness relationship is irreflexive, transitive, asymmetric (\(\forall x,y \left( {x} \mathrel {\prec ^{r}_{\mathtt {T}}} {y} \Longrightarrow \lnot ( {y} \mathrel {\prec ^{r}_{\mathtt {T}}} {x})\right)\)), and acyclic.
Table 4 illustrates the behavior of these definition by analyzing whether the value v is non-strictly or strictly closer to the reference 0 than 1 for any scale type \(\mathtt {T}\) (i.e., \({v} \mathrel {{\preccurlyeq ^{0}_{\mathtt {T}}}} {1}\) and \({v} \mathrel {\prec ^{0}_{\mathtt {T}}} {1}\)). Each column is associated with a value v, and each row with non-strict or strict closeness under different scale types. In other terms, we ask whether a given value v is closer to 0 than 1 for a given scale type; the upper and lower parts of the table discuss strict and non-strict closeness, respectively. For instance, looking at the first row, all values are non-strictly closer (actually, equally closer) to 0 than 1 for the nominal scale type, since a permissible transformation function, i.e., a bijective function, can be found that transforms any value into any other one, including 0; looking at the fourth row, only the zero value is strictly closer to 0 than 1 for the nominal scale type. For the ordinal scale type (second row), only 2 can not be non-strictly closer to 0 than 1 (because 2 is farther away to 0 than 1 for any monotonic transformation); looking at the fifth row, any value in [0, 1) is strictly closer. For the interval and ratio scale types, any value in \((-1,1)\) is strictly closer to 0 than 1. In general, the table reflects the subsumption of permissible transformation functions across scale types (Formulas (30) and (31)). The effect is that, when considering higher scale types, the number of strict closeness relationships increases, whereas the number of non-strict closeness relationships decreases.
This definition of (strict) closeness is general and valid for any scale type. We can instantiate it into the four scale types as shown by the following lemma; this form helps intuition and it will be useful to develop the formal proofs in the following (all the proofs are in Appendix B).
Lemma 1
(Closeness for the four scale types) Let\(x, y, r \in {{\,{{\mathbb {R}}}\,}}\). The valuexis (strictly) closer to a referencerthany for each scale type \(\mathtt {T}\)respectively if and only if the conditions in Table 5are satisfied.
We now turn to closeness for assignments. We define two kinds of closeness (value-oriented and equivalence-oriented), whose meaning is discussed in the example in Sect. 3.7.
3.5 Value-oriented assignment closeness
On the the basis of value closeness, we define a first closeness notion between assignments for a certain scale type. Consistently with Definition A.1, we denote assignments with \(\omega\), \(\omega ^{\prime }\), \(\omega _i\), etc.
Definition 3
(Value-oriented assignment closeness) Given a set of objects \({\mathcal {D}}\), an assignment \(\omega\) is value-closer to a reference assignment \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) (we write \({\omega }\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}\) and we speak of Value-Oriented Assignment Closeness) if and only if for every value (i.e., objects in \({\mathcal {D}}\)) \(\omega\) is closer to \(\rho\) than \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\):
Moreover, we say that an assignment\(\omega\)is strictly value-closer to a reference assignment \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) if:
Note that strict value-closeness can be expressed as:
i.e., by applying (4) for any scale type \(\mathtt {T}\in \{\mathtt {N}, \mathtt {O},\mathtt {I},\mathtt {R}\}\),
We use “value-closer” (and not simply “closer”) because we now define another closeness notion for assignments, before discussing an example in Sect. 3.7.
3.6 Equivalence-oriented assignment closeness
We formalise the closeness between assignments in terms of their equivalence class, rather than value correspondence.
Definition 4
(Equivalence-oriented assignment closeness) Given a set of objects \({\mathcal {D}}\), an assignment \(\omega\) is equivalence-closer to a reference \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) (we write \({\omega }\mathrel {\sqsubseteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}\) and we speak of Equivalence-Oriented Assignment Closeness) if and only if for every assignment \(\omega _i^{\prime }\) in the equivalence class of \(\omega ^{\prime }\), there exists at least one assignment in the equivalence class of \(\omega\) that is value-closer to \(\rho\) for the scale type \(\mathtt {T}\):
The strict closeness is analogous to Definition 3:
Two assignments are closer according to Definition 3 if they tend to assign similar values, and according to Definition 4 if there exists a permissible transformation function that makes them assign similar values. In general, Definition 3 is useful for tasks like “measure temperature using the Celsius scale”; Definition 4 for tasks like “measure temperature using a scale of the interval scale type”. Referring to our abstract tasks, as we will discuss in detail in the following, the former is useful for categorization, where the assigned values are important; the latter for clustering, where cluster labels can be changed without affecting the result. The two kinds of assignment closeness lead to two different families of metrics, as discussed in Sect. 4; we first provide an intuitive example.
3.7 An example
As an example, consider the situation in Table 6, representing some assignments of three objects in \({\mathcal {D}} = \{o_1, o_2, o_3 \}\): \(\rho\) is the reference assignment, and the \(\omega _i\) are some different assignments. Now, if the scale type is \(\mathtt {N}\) or \(\mathtt {O}\), all the assignments are equivalent. Therefore, these \(\omega _i\) are all equally close to the reference \(\rho\) in terms of equivalence-closeness.
However, the situation changes when looking at value-oriented closeness, since \(\omega _2\) and \(\omega _3\) are strictly value-closer to \(\rho\) than the other assignments, given that they achieve equality for objects \(o_2\) and \(o_3\). Thus, for the \({\mathtt {N}}\) scale type:Footnote 4
If the scale type is \(\mathtt {O}\) the situation is slightly more complex. In addition to the previous relationships, now \(\omega _2\) is closer to \(\rho\) than \(\omega _3\) (\({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {O}}} {\omega _3}\)) since the value 11 is ordinal closer to 10 than 12. At interval (or ratio) scale type, in addition, \({\omega _4}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {I}}} {\omega _1}\), and therefore (where \(\mathtt {T}\) is \(\mathtt {I}\) or \(\mathtt {R}\))
The meaningful statements of the interval scale type capture differences between assignments that are not captured at ordinal or nominal scale types.
When considering again equivalence-oriented assignment closeness, we obtain a different, and contradictory, outcome: \(\omega _1\) and \(\omega _4\) are closer to \(\rho\) than \(\omega _2\) and \(\omega _3\), i.e.,
To see why, consider that for any linear transformation applied to \(\omega _2\) or \(\omega _3\), we can define a transformation for \(\omega _1\) or \(\omega _4\) to make them closer to \(\rho\): for instance \(\omega _1^{\prime } = \omega _1 * 10\) and \(\omega _4^{\prime } = \omega _4 - 2\). Finally, if the scale type is \(\mathtt {R}\), \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _4\), i.e.,
It is clear that which assignment to prefer depends on the scale type, the abstract task, and whether value-oriented or equivalence-oriented closeness is used.
4 Metrics: two families, eight classes
We can now turn to analyse effectiveness metrics. As it can be seen in the following, the previous framework based on the inclusion of closeness in measurement theory allows general definitions and theorems (in the next sections).
We first define some notation and provide a basic definition of metrics (in Sect. 4.1); on this basis, and exploiting the two kinds of closeness notions, namely value- and equivalence-oriented (Definitions 3 in Sects. 3.5 and 4 in Sect. 3.6, respectively), we distinguish between two families of metrics, namely value- and equivalence-oriented metrics (in Sects. 4.2 and 4.3, respectively). We then revisit the example of Sect. 3.7 (in Sect. 4.4) and classify the metrics into eight classes (in Sect. 4.5).
4.1 System outputs, gold standards, and metrics
The first step is to formalise system outputs and golds as assignments: they can be relevance judgments, assigned categories, etc. We use \(\alpha\) for golds and \(\sigma\) for system outputs (\(\alpha\) is a mnemonic for assessment, \(\sigma\) for system); thus \(\alpha , \sigma \in \varOmega\) (see Definition A.1).
Definition 5
(System output and gold) A system output \(\sigma\) or a gold \(\alpha\) is an assignment from a set of documents \({\mathcal {D}}\) to real numbers \({{\,{{\mathbb {R}}}\,}}\):
This approach is different from the classical approach by van Rijsbergen (1981) who focussed on considering measuring retrieval effectiveness itself as a measurement, as well as from the recent proposal by Ferrante et al. (2017), who consider evaluation metrics as measurements and set to determine if an IR evaluation metric is a measure on an interval scale. We do not represent an effectiveness metric as a measurement. Our approach is more similar to the already cited work of Busin and Mizzaro (2013) and Maddalena and Mizzaro (2014), but with an important difference. That previous work modeled system outputs and golds as measurements; however, this is not necessary in our approach, where they are simply assignments. This is an important simplification. We also remark that we are not the only ones to represent Golds as assignments. For example, Ferrante et al. (2019, Section 4) write: “the ground-truth GT is a map which assigns a relevance degree \(rel \in REL\) to a document d with respect to a topic \(t^{\prime \prime }\). We extend that approach by applying it to (i) system outputs and (ii) other abstract tasks beyong IR (ranking).
On the basis of Definition 5 we can represent any system output and gold. For example, when human assessors judge relevance using the usual 4-levels scale for Highly relevant, Relevant, Marginally relevant, Not relevant, it is common to translate them into the numeric values 3, 2, 1, and 0. Turning to system outputs, of course the Retrieval Status Values are an assignment; but any ranked list of retrieved documents can easily be converted to an assignment, for example using the reciprocal of the rank. If the abstract task is not Ranking but, let us say, Clustering, the gold and system output will be again an assignment of numbers to documents, with the natural convention that two documents having the same value means that they are in the same cluster, according to the gold and/or the system output; similarly for Classification.
On these basis we define a metric as a function that, given a system output \(\sigma\) and a gold \(\alpha\), returns a real value that depends on how much \(\sigma\) is close to \(\alpha\).
Definition 6
(Metric) A metric is a function \({\mathcal {M}}: \varOmega ^2 \longrightarrow {{\,{{\mathbb {R}}}\,}}\).
We remark that some authors and metrics require a bounded codomain for \({\mathcal {M}}\). Moffat’s (2013) first property is Boundedness (see Sect. 2.3). Several metrics assume values in [0, 1]. However, this is not always the case: some metrics (that we will analyze in the following) such as Utility metrics (for classification) are unbounded and assume values in \((-\infty ,+\infty )\), other metrics like DCG (for IR), and MAE (for quantitation) in \([0,+\infty )\), and Pearson and Spearman in \([-1, +1]\). We choose the most general possibility for two reasons: (i) we aim at a general framework, thus we do not want to exclude metrics and to do so we focus on monotonicity and invariance properties; and (ii) normalizations can be applied, though this seems a technical and minor issue that we leave for future work.
However, we have defined two notions of closeness; therefore we define two families of metrics, as follows.
4.2 First family: value-oriented metrics
Value-oriented metrics quantify to what extent a system resembles the values assigned to items in the gold. The more the system values are close to the gold values, the more the metric returns high scores for the system. For instance, the values spam/non spam assigned by a spam filter should match with the values given by the human references. Given that the concept of closeness defined in the previous section depends on the scale type, the definition of a value-oriented metric is also dependent on the scale type. In addition, transforming both the system output and the gold by the same permissible transformation function should not affect the metric result. For instance, we can apply a bijective transformation from the value representing the category “spam” to the value representing the category “trash-messages” (that, being bijective, is in \({\mathcal {F}}_{\mathtt {N}}\), i.e., is a permissible transformation function for the nominal scale) as long as we do it for both the system output and the gold.
We first define the following two properties.
Property 1
(VOI, value-oriented invariance) A metric\({\mathcal {M}}\)is value-oriented invariant for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system output\(\sigma\), both of them on the same set of objects\({\mathcal {D}}\), the metric value does not change by applying the same permissible transformation to both\(\alpha\)and\(\sigma\):
Property 2
(VOM, value-oriented monotonicity) A metric\({\mathcal {M}}\)is value-oriented monotonic for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system outputs\(\sigma\)and and\(\sigma ^{\prime }\), all three of them on the same set of objects\({\mathcal {D}}\), it holds that if\(\sigma\)is value-closer to\(\alpha\)than\(\sigma ^{\prime }\), then the metric value for\(\sigma\)has to be higher than that for\(\sigma ^{\prime }\). In formulas:
We can now specialize Definition 6 and formally define a value-oriented metric on the basis of value-oriented assignment closeness (Definition 3) as follows.
Definition 7
(Value-oriented metric) A value-oriented evaluation metric for the scale type \(\mathtt {T}\) is a metric that satisfies the two properties VOI and VOM for the scale type \(\mathtt {T}\).
Therefore, the metrics of this family need to satisfy just two basic properties: (i) invariance to permissible transformation functions (VOI), i.e., the metric value does not change when transforming both assignments in the same permissible way, and
(ii) monotonicity to value-oriented closeness (VOM), i.e., if an assignment is value-closer to the gold than another, then the former has a higher metric value. Note that the two properties VOI and VOM depend on the scale type: a given function could be a metric for a scale type \(\mathtt {T}\) and not be a metric for another scale type \(\mathtt {T'}\).Footnote 5
4.3 Second family: equivalence-oriented metrics
Directly comparing the values of system outputs to the values assigned by the gold is in some cases too strict. For instance, the purpose of search engines consists of offering the most relevant documents rather than quantifying the relevance of documents: the key point is that any assignment that keeps the ranking of documents in the search engine output is equally effective. That is, any assignment in the same equivalence class for the ordinal scale type.
The metrics of the second family, the equivalence-oriented metrics, are formally defined on the basis of the following two properties.
Property 3
(EOI, equivalence-oriented invariance) A metric\({\mathcal {M}}\)is equivalence-oriented invariant for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system output\(\sigma\), both of them on the same set of objects\({\mathcal {D}}\), the metric value does not change by applying any permissible transformation to\(\sigma\). In formulas:
Property 4
(EOM, equivalence-oriented monotonicity) A metric\({\mathcal {M}}\)is equivalence-oriented monotonic for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system outputs\(\sigma\)and and\(\sigma ^{\prime }\), all three of them on the same set of objects\({\mathcal {D}}\), it holds that if\(\sigma\)is equivalence-closer to\(\alpha\)than\(\sigma ^{\prime }\), then the metric value for\(\sigma\)has to be higher than that for\(\sigma ^{\prime }\). In formulas:
We can now specialize again Definition 6 and define an equivalence-oriented metric on the basis of equivalence-oriented assignment closeness (Definition 4) as follows.
Definition 8
(Equivalence-oriented metric) An equivalence-oriented evaluation metric for the scale type \(\mathtt {T}\) is a metric that satisfies the two properties EOI and EOM for the scale type \(\mathtt {T}\).
Therefore, also the metrics of this family need to satisfy two basic properties only, namely invariance (EOI) and monotonicity (EOM), although these are defined in a slightly different way from the corresponding properties of the previous family of metrics (VOI and VOM): in EOI the function f is applied to \(\sigma\) only, and in EOM equivalence-oriented assignment closeness is used. As above, we will use the scale type as a subscript for the EOI and EOM properties when needed (see Footnote 5).
4.4 The example revisited
We provide some intuition by discussing again the example in Sect. 3.7 and Table 6: here \(\rho\) is the gold (\(\alpha\) using the notation of this section) and the \(\omega _i\) are the system outputs (\(\sigma _i\)). Let us first interpret the assignments as being of the \(\mathtt {I}\) scale, and focus on \(\omega _1\) and \(\omega _2\). Since \(\omega _2\) is value-closer to \(\rho\) than \(\omega _1\) (\({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {I}}} {\omega _1}\), see Formula (10)), a value-oriented metric will assign a higher value to \(\omega _2\) than to \(\omega _1\) (\({\mathcal {M}}(\omega _2,\rho )>{\mathcal {M}}(\omega _1,\rho )\)), because of the VOM property. Conversely, since \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _2\) (\({\omega _1}\mathrel {\sqsubset ^{\rho }_{\mathtt {I}}} {\omega _2}\), see Formula (11)), an equivalence-oriented metric will assign a higher value to \(\omega _1\) than to \(\omega _2\) (\({\mathcal {M}}(\omega _1,\rho )>{\mathcal {M}}(\omega _2,\rho )\)), because of the EOM property. Metrics of the first family reward \(\omega _2\) for “almost guessing” the correct \(\rho\) values; metrics of the second family reward \(\omega _1\) for guessing better the ratios between the intervals (i.e., the meaningful statements for \(\mathtt {I}\)) among the values in \(\rho\).
Indeed, \(\omega _1\) guess of the ratios of the intervals is not only better, it is perfect. This can be seen also considering the EOI property: if the values in \(\omega _1\) are multiplied by ten (a permissible transformation function for \(\mathtt {I}\)) we obtain exactly \(\rho\). Since it is not possible to do better that \(\omega _1\), an equivalence-oriented metric should assign to \(\omega _1\) a higher value than any other one. Note that the above remarks hold also when replacing \(\omega _1\) with \(\omega _4\) (apart from changing the permissible transformation function), as \(\omega _4\) is equivalent to \(\omega _1\) (simply use the transformation \(f(x) = \frac{x -2}{10}\)) and to \(\rho\) (\(f(x) = x -2\)). This also means, again because of EOI, that \({\mathcal {M}}(\omega _1,\rho )={\mathcal {M}}(\omega _4,\rho )\).
Changing the scale will in general change the situation. For example, if we now interpret the same assignments as being of the \(\mathtt {R}\) scale, we get different outcomes. On the \(\mathtt {R}\) scale, \(\omega _4\) is not equivalent to \(\omega _1\) and \(\rho\) anymore: there is no function in \({\mathcal {F}}_{\mathtt {R}}\) that maps the one into the other two. Indeed, since \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _4\) (see Formula (12)), equivalence-oriented metrics for the ratio scale will assign a higher value to \(\omega _1\) than to \(\omega _4\) (\({\mathcal {M}}(\omega _1,\rho )>{\mathcal {M}}(\omega _4,\rho )\)), because of the EOM property.
4.5 Eight classes of metrics
Our framework is now complete: we have provided two definitions, one for each metric family: value- and equivalence-oriented metric. Both definitions can be applied at different scale types: nominal, ordinal, interval or ratio. By combining the four scale types \(\mathtt {N, O, I, R}\) and the two families of metrics (value- and equivalence-oriented) we obtain the eight classes of metrics summarized in Table 7.
In the following we prove some theorems that show that:
-
The basic axioms proposed in the literature for specific abstract tasks can be derived from the general metric definition, taking into account the particular combination of family and scale type. Notice that we have used the term basic axiom instead of axiom. The reason is that, as we have seen in Sect. 2, some axioms depend on the particular task, while other (basic) axioms are common for any task that fits into the corresponding abstract task.
-
Existing abstract tasks, and the corresponding metrics, actually fit into our classification. More specifically, we show that each information access abstract task (classification, clustering, etc.) corresponds to a metric class, i.e., a particular combination of metric family and scale type, as well as that metrics that fit in the same category according to these two dimensions are used in the literature for that abstract task.
-
The theoretical limitations of metrics that have been identified in the literature (i.e., metrics that do not satisfy basic axioms) can be explained also in the general framework proposed in this paper.
-
By exploring the classes of metrics along the two dimensions (scale type and family of metric), it is possible to address evaluation gaps and provide formal definitions, for example, of Ordinal Classification metrics, which have not been addressed yet.
5 Properties and scale types
We start by making explicit some implication relationships between the four properties VOI, VOM, EOI, and EOM at different scale types. These are derived from the fact that permissible transformation functions are subsumed across scale types (see, in Appendix A, Sect. A.2 and in particular Formulas (30) and (31)). That is, the set of bijective functions includes the set of monotonic functions, which in turn includes the set of linear affinity functions. Therefore, closeness for low scale types (e.g., nominal) implies closeness for higher scale types (e.g., interval). The resulting relationships are listed in the following lemma and in the two subsequent corollaries, with the aims of: (i) provide the basis to prove some properties in the following of the paper, and (ii) help to better understand the meaning of the four axioms VOI, VOM, EOI, EOM and their relationships with the four scales \(\mathtt {N, O, I, R}\).
Lemma 2
(Four properties, four scale types) The following relationships hold among the four properties (VOI, VOM, EOI, EOM) and the four scale types (\(\mathtt {N, O, I, R}\)).
-
(a)
If a metric satisfies VOI for a certain scale type, then it satisfies VOI for higher scale types:
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T} < \mathtt {T'} ~( \mathrm {VOI}_{\mathtt {T}}&\Longrightarrow \mathrm {VOI}_{\mathtt {T'}} ). \end{aligned}$$ -
(b)
If a metric satisfies EOI for a certain scale type, then it satisfies EOI for higher scale types:
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T}< \mathtt {T'} ~( \mathrm {EOI}_{\mathtt {T}}&\Longrightarrow \mathrm {EOI}_{\mathtt {T'}} ). \end{aligned}$$ -
(c)
VOI for the nominal or ordinal scale type and VOM for higher scale types are incompatible:
$$\begin{aligned} \forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T'} \in \{\mathtt {O, I, R} \}, \mathtt {T} < \mathtt {T'} \left( \lnot \left( \mathrm {VOI}_{\mathtt {T}}\wedge \mathrm {VOM}_{\mathtt {T'}}\right) \right) . \end{aligned}$$ -
(d)
VOI for the nominal or ordinal scale type and EOM for higher scale types are incompatible:
$$\begin{aligned} \forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T'} \in \{\mathtt {O, I, R} \}, \mathtt {T}< \mathtt {T'} \left( \lnot \left( \mathrm {VOI}_{\mathtt {T}}\wedge \mathrm {EOM}_{\mathtt {T'}}\right) \right) . \end{aligned}$$ -
(e)
VOM and EOM are incompatible, whatever the scale type:
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \} \left( \lnot ( \mathrm {VOM}_{\mathtt {T}} \wedge \mathrm {EOM}_{\mathtt {T'}}) \right) . \end{aligned}$$ -
(f)
\(\hbox {VOM}_\mathtt {I}\)and\(\hbox {VOM}_{\mathtt {R}}\)are equivalent:Footnote 6
$$\begin{aligned} \mathrm {VOM}_\mathtt {I}&\Longleftrightarrow \mathrm {VOM}_{\mathtt {R}}. \end{aligned}$$ -
(g)
VOM and EOI are incompatible, whatever the scale type:
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \} \left( \lnot ( \mathrm {VOM}_{\mathtt {T}} \wedge \mathrm {EOI}_{\mathtt {T'}}) \right) . \end{aligned}$$ -
(h)
EOM for a certain scale type and EOI for lower scale types are incompatible:
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T'} < \mathtt {T} \left( \lnot ( \mathrm {EOM}_\mathtt {T} \wedge \mathrm {EOI}_{\mathtt {T'}}) \right) . \end{aligned}$$
From this lemma, we can infer the following corollaries. It is easy to see that items (c), (d), (e), (g), and (h) can be restated as implications.
Corollary 1
(Incompatibilities and implications) Items (c) and (d) of Lemma 2can be restated as follows. If a metric satisfies VOI for the nominal or ordinal scale type, then it does not satisfy VOM and EOM for higher scale types, and vice-versa if a metric satisfies VOM or EOM for higher scale types, then it does not satisfy VOI for the nominal or ordinal scale type:
Item (e) of the lemma can be restated as follows. If a metric satisfies VOM for a certain scale type, then it does not satisfy EOM for any scale type, and vice-versa:
Item (g) of the lemma can be restated as follows. If a metric satisfies EOI for any scale type, then it does not satisfy VOM for any scale type, and vice-versa:
Item (h) of the lemma can be restated as follows. If a metric satisfies EOM for a certain scale type, then it does not satisfy EOI for lower scale types, and vice-versa:
Another result is that, knowing that a metric fits into one metric class ensures that that metric does not fit into other definitions, with only one exception. The unification between value-oriented metrics for interval and ratio scale types is due to the fact that the concepts of closeness for those two scale types are equivalent (see Table 5).
Corollary 2
(Compatibility of value-oriented interval and ratio) Under our definition of metrics, there exists only one case in which a metric can be classified in more than one class(see Table 7): the value-oriented metrics for the interval and ratio scale types (5 and 7 in the table).
6 Basic axioms in the literature
In this section we state some theorems that prove that the basic axioms proposed in the literature for classification, clustering, and ranking, i.e., GMON (Generalized Strict Monotonicity Axiom), GHC (Generalized Homogeneity / Completeness), and PRI (Priority Axiom) (see Sect. 2) can be derived in our framework. As anticipated in Sect. 4.5, we focus on the basic axioms that we have identified in Sect. 2 (i.e., the axioms marked with (*) and shown in italics in the tables in Sect. 2), leaving aside the task-dependent axioms. As noted in Sect. 2.4, there are no axiomatics, and no basic axioms, for quantitation, at least in the context of information access evaluation: it is left out from this section, and analyzed in Sect. 8.1.
6.1 Classification: GMON is equivalent to \(\hbox {VOM}_ \mathtt {N}\)
As discussed in Sect. 2.1, the common axiom that can be applied to any classification task is the Strict Monotonicity Axiom (MON). It states that if \(\sigma\) and \(\sigma ^{\prime }\) are two classifiers and \(\alpha\) is the ground truth, all of them over a set of documents \({\mathcal {D}}\), and if \(\sigma\) and \(\sigma ^{\prime }\) differ only for a single document \(d \in {\mathcal {D}}\), which is correctly classified for \(\sigma\) and wrongly for \(\sigma ^{\prime }\), then the metric value must be higher for \(\sigma\). More formally, if
then \({\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha )\).
This definition requires that \(\sigma\) and \(\sigma ^{\prime }\) return the same result for all documents different from d. However, this is not strictly necessary: we can slightly generalise MON requiring that both systems are accurate or wrong, with respect to the gold, for the same documents, except for d.
Axiom 1
(GMON, generalized strict monotonicity axiom) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments. If every document with an error in\(\sigma\)is also an error in\(\sigma ^{\prime }\), and there exist an error in\(\sigma ^{\prime }\)which is not an error in\(\sigma\), i.e.,
then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):
When there are only two classes (only two possible values in \(\sigma\) and \(\alpha\)), GMON and MON are equivalent, given that if \(\sigma (d)\ne \alpha (d)\) and \(\sigma ^{\prime }(d)\ne \alpha (d)\) then \(\sigma (d)=\sigma ^{\prime }(d)\).
We can now prove the following theorem.
Theorem 1
(\(\hbox {VOM}_\mathtt {N}\) and GMON) The VOM property for the nominal scale type (\(\hbox {VOM}_\mathtt {N}\)) and the Generalized Strict Monotonicity (GMON) axiom are equivalent.
6.2 Clustering: GHC is equivalent to \(\hbox {EOM}_\mathtt {N}\)
As mentioned in Sect. 2.2, the basic clustering axioms Homogeneity and Completeness can be generalized into a unique GHC axiom, which can be formalized as follows.
Axiom 2
(GHC, generalized homogeneity/completeness) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments. If (i) for each document pair if\(\sigma ^{\prime }\)is correct then also\(\sigma\)is correct, i.e.,
and (ii) there exists at least a document pair\(d_1\), \(d_2\)such that\(\sigma\)adds to\(\sigma ^{\prime }\)a correct relation, i.e.,\(\sigma\)is correct and\(\sigma ^{\prime }\)is not, i.e.,
then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):
The following theorem states that the GHC and the EOM properties for the nominal scale type are equivalent.
Theorem 2
(\(\hbox {EOM}_\mathtt {N}\) and GHC) The EOM property for the nominal scale type (\(\hbox {EOM}_\mathtt {N}\)) and the Generalized Homogeneity and Completeness (GHC) axiom are equivalent.
6.3 Ranking: PRI is equivalent to \(\hbox {EOM}_\mathtt {O}\)
We have seen in Sect. 2.3 that the basic axiom Swapping appears in most axiomatics and that it can be generalized as the Priority axiom (PRI). PRI states that swapping two contiguous documents in the ranking according to the gold necessarily increases the score. We formalize it as follows.
Axiom 3
(PRI, priority axiom) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments such that\(d_i\)and\(d_j\)have contiguous values at scale type\(\mathtt {O}\)in both\(\sigma\)and\(\sigma ^{\prime }\). If:Footnote 7
then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):
We can prove the following theorem.
Theorem 3
(\(\hbox {EOM}_\mathtt {O}\) and PRI) The EOM property for ordinal scale type (\(\hbox {EOM}_\mathtt {O}\)) and the Priority (PRI) axiom are equivalent.
In other words, our definition of metrics captures the basic axiom for metrics in ranking tasks. We now turn to analyse the implications for specific tasks and metrics.
7 Metrics analysis
In this section, we analyse existing metrics in terms of our theoretical framework. We have the twofold aim of: (i) showing that existing abstract tasks and corresponding metrics are explained by our framework, and (ii) that the theoretical limitations of metrics that have been identified in the literature (i.e., metrics that do not satisfy basic axioms) are captured in our framework as well. As noted in the previous section, we postpone the quantitation case to Sect. 8.1.
Table 8 summarises the analysis carried out in this section. The columns represent the metrics categorised by abstract tasks. The rows represent the properties VOI, VOM, EOI, and EOM, at different scale types. Circles indicate that properties are satisfied, black circles emphasize that both properties (from the corresponding definition of evaluation metric) are satisfied. As the table shows, only metrics in the corresponding category satisfy our definition of metric. In addition, for each abstract task category, there exist metrics which are not able to satisfy both properties. This does not mean that these metrics are absolutely useless. For instance, Purity and Inverse Purity in clustering, or P@N in IR have the advantage of being easy to interpret. However, the evaluation results have to be analysed carefully to prevent misinterpretations of systems’ quality. We will see in this section that these cases correspond with theoretical metric drawbacks previously reported in the literature with practical implications.
7.1 Classification: value-oriented, nominal
We define classification metrics as follows.
Definition 9
(Classification metric) A classification metric is a value-oriented metric for the nominal scale type.
According to our aim (i) above, we want to show that metrics satisfying VOI and VOM for the nominal scale type are those that are used for classification problems in the literature. Most of the metrics used in classification tasks are combinations of the contingency matrix components, i.e., the amount of true and false, positive and negative samples. Formally, the contingency matrix can be defined as follows.
Definition 10
(Contingency matrix) Being the system output \(\sigma\) and the gold \(\alpha\) two functions over a limited amount of values \({\mathcal {V}}\), the contingency matrix \(C:{\mathcal {V}}\times {\mathcal {V}}\rightarrow {\mathcal {N}}\) is:
We can prove easily that C(x, y) is invariant across the permissible transformation functions for the nominal scale type \({\mathcal {F}}_\mathtt {N}\) (i.e., the bijective functions, see Sect. A.2 in Appendix A) applied to \(\sigma\) and \(\alpha\). In other words, changing the names of categories without merging them (as it is done with bijective functions) in both the gold and the system output does not affect the contingency table. That is, being \(f_b\) any bijective function
Therefore, we can state the following theorem.
Theorem 4
(\(\hbox {VOI}_{\mathtt {N}}\) and contingency matrix) Any function over the elements in the contingency matrix satisfies the VOI property for the nominal scale type (\(\hbox {VOI}_{\mathtt {N}}\)).
Table 8 includes some metrics used in classification tasks, such as Accuracy, Macro-Average Accuracy, F-measure, Odds ratio, Lam%. All of them are computed from the contingency matrix. Therefore, according to the previous theorem, they satisfy \(\hbox {VOI}_{\mathtt {N}}\). According to Lemma 2(a) they also satisfy VOI for the rest of (higher) scale types.
We can prove that some common metrics applied in classification satisfy \(\hbox {VOM}_{\mathtt {N}}\) (and, given Theorem 1, GMON).
Theorem 5
(\(\hbox {VOM}_{\mathtt {N}}\) and classification metrics) The metrics Accuracy, Macro Average Accuracy and Phi Correlation satisfy the VOM property for the nominal scale type (\(\hbox {VOM}_{\mathtt {N}}\)).
Given that these metrics also satisfy \(\hbox {VOI}_{\mathtt {N}}\) (according to Theorem 4), we can state that they fit into our definition of value-oriented metrics for the nominal scale type, and therefore, they are classification metrics.
We now turn to our aim (ii) above. The main theoretical drawbacks of classification metrics are related with MON. For instance, according to Sebastiani (2015), the F-measure (computed as the harmonic mean of precision and recall) does not satisfy MON when there is a zero value in one component of the contingency matrix. According to Qi et al. (2010), the classification metric Lam% has the same drawback when zero values appear in the contingency matrix. Something similar happens with the Odds ratio. This problem can be solved by considering the contingency table as a probabilistic distribution and applying smoothing techniques (Amigó et al. 2018), but this is not the focus of this paper. Given that they do not satisfy MON, they do not satisfy also GMON, and therefore, \(\hbox {VOM}_{\mathtt {N}}\). Mutual Information (MI) is another metric used in classification. However, we can assert that it does not satisfy GMON (and \(\hbox {VOM}_\mathtt {N}\)), given that according to MI, an output achieves the highest score even if the label names are replaced. We will see shortly that MI is a clustering metric.
The second and third columns of Table 8 summarize the properties satisfied by these metrics according to the implications derived from Lemma 2. In general, the main result of this analysis is that metrics used in classification tasks fit into our definition, while the main limitations of metrics such as F-measure, Lam% or Odds identified in the literature are also captured by our model.
7.2 Clustering: equivalence-oriented, nominal
We define clustering metrics as follows.
Definition 11
(Clustering metric) A clustering metric is an equivalence-oriented metric for the nominal scale type.
We aim to show that this definition actually captures the existing clustering metrics and their desirable basic constraints. We start by noting that existing clustering metrics such as Purity and Inverse Purity, BCubed precision and Recall, Entropy and Class Entropy, F-measure, or clustering metrics based on counting, are all functions over a set partition. A partition is the standard set concept: a decomposition of a set in subsets such that their (pairwise) intersection is empty and their union is the original set. A function over a partition is any function that takes the original set and its partition as input parameters. We can now state as a theorem that all the above metrics satisfy \(\hbox {EOI}_\mathtt {N}\).
Theorem 6
(\(\hbox {EOI}_{\mathtt {N}}\) and functions over a partition) Any function over a set partition satisfies the EOI property for the nominal scale type (\(\hbox {EOI}_{\mathtt {N}}\)).
Checking if each metric proposed in the literature satisfies \(\hbox {EOM}_{\mathtt {N}}\) (or GHC, which is equivalent according to Theorem 2), is too complex to be included in this paper. For this reason, we will analyse the existing metrics by using the categorisation proposed by Amigó et al. (2009). Then, we will check to what extent a category can produce metrics that satisfy EOM and EOI at the nominal scale type.
A first category is called counting pairs based metrics [e.g., Rand Statistic, Jaccard Coefficient, or Folkes and Mallows (Meila 2003; Halkidi et al. 2001)], that count how many pairs correspond to the gold in terms of same/different cluster: correct relationships between pairs of items increases the score. This principle matches directly with the GHC conditions, which is equivalent to \(\hbox {EOM}_{\mathtt {N}}\). On the other hand, the metric BCubed (Amigó et al. 2009) also increases with the amount of document pairs that are consistent with the gold. It also satisfies GHC. Therefore, according to Theorem 2, we can state that metrics based on counting pairs fit into our definition of clustering metric.
A second category is the entropy based metrics. Some examples are Entropy and Class Entropy (Wu et al. 2003), Variation of Information (Meila 2003), Mutual Information (Xu et al. 2003) or V-measure (Rosenberg and Hirschberg 2007). Proving that all those metrics satisfy EOM and EOI requires too much analysis for this paper: we focus on Entropy and Class entropy only. Given a set of documents \({\mathcal {D}}\), a ground truth clustering \(\alpha\) and a clustering \(\sigma\), the average Entropy of clusters is computed as (Wu et al. 2003)
where the sums are over the documents \(d \in {\mathcal {D}}\), as well as the probability P computed by frequency count as usual, \({\mathcal {V}}(\alpha )\) is the set of different values generated by the gold assignment \(\alpha\), and \({\mathcal {V}}(\sigma )\) is the set of values generated by the system assignment \(\sigma\). The Class Entropy is defined as
Notice that the evaluation score is inversely correlated with the entropy values. Then we can prove the following theorem.
Theorem 7
(\(\hbox {EOM}_{\mathtt {N}}\) and entropy metrics) Entropy and Class Entropy satisfy the EOM property for the nominal scale type (\(\hbox {EOM}_{\mathtt {N}}\)).
The conclusion is that BCubed, metrics based on counting pairs, and entropy based metrics are able to satisfy \(\hbox {EOI}_\mathtt {N}\) and \(\hbox {EOM}_\mathtt {N}\). Therefore, they fit into our definition of clustering metric. However, not all metrics used in clustering tasks fit into our definition. In particular, metrics based on set matching such as F-measure [notice that F-measure has a different meaning in the context of clustering (Amigó et al. 2009)] or Purity and Inverse Purity do not satisfy Completeness (Amigó et al. 2009). Therefore, they do not satisfy GHC and, according to Theorem 2, neither \(\hbox {EOM}_\mathtt {N}\). The reason is that they assume a certain correspondence between system output and gold clusters. This produces a lack of sensitivity in some cases.
The fourth and fifth columns in Table 8 illustrate the properties satisfied by these metrics according to the implications derived from Lemma 2. In summary, we can say that most metrics used in clustering fit into our definition, capturing the limitations described in the literature.
As we mentioned before, there exist other axioms that are not captured by our definition, but they are task-dependent. For instance, the range of values (Meila 2003), the ability to join single clusters into a rag bag cluster (Amigó et al. 2009), or the robustness under the overweighting of big clusters due to the combinatory explosion of document pairs.
7.3 Ranking: equivalence-oriented, ordinal
We define ranking metrics as follows.
Definition 12
(Ranking metric) A ranking metric is an equivalence-oriented metric for the ordinal scale type.
Let us see that ranking metrics satisfy \(\hbox {EOI}_\mathtt {O}\) and \(\hbox {EOM}_\mathtt {O}\). Let us first consider the obvious fact that ranking metrics compare system output ranking against the gold. By definition, a ranking is invariant across monotonic functions (the permissible transformation functions for ordinal scale types, \({\mathcal {F}}_\mathtt {O}\)). Therefore, we can state the following theorem.
Theorem 8
(\(\hbox {EOI}_{\mathtt {O}}\) and ranking metrics) Metrics that compare rankings with a gold standard satisfy the EOI property for the ordinal scale type (\(\hbox {EOI}_{\mathtt {O}}\)).
Concerning \(\hbox {EOM}_\mathtt {O}\), we have seen in Theorem 3 that it is equivalent to the Priority axiom (PRI). According to Amigó et al. (2013), many metrics applied in ranking problems actually satisfy PRI. Therefore, we can state the following theorem.
Theorem 9
(\(\hbox {EOM}_\mathtt {O}\) and ranking metrics) The metrics MAP, DCG, nDCG, RBP, ERR, and ordinal correlation coefficients such as Kendall or Spearman satisfy the EOM property for the ordinal scale type (\(\hbox {EOM}_\mathtt {O}\)).
Therefore, many of the metrics used in ranking problems actually fit our definition. This is the case of metrics such as MAP, DCG, nDCG, RBP, ERR and also ordinal correlation coefficients such as Kendall or Spearman. However, these last two coefficients are normally not used in IR, since the collection contains a huge amount of documents that will never be explored by the user. Indeed, IR evaluation metrics, besides satisfying the priority axiom, also give more weight to the top of the ranking returned by the system. This is captured for example by Moffat’s properties Convergence and Top-weightedness (2013), or by Amigó et al.’s Deepness, Closeness Threshold, and Deepness Threshold (2013) (see Table 3). These properties are not captured in our framework and have to be added explicitly if needed; this is usually the case, although, as we already discussed in Sect. 2.3, one might imagine a ranking task where the top-weightedness property is undesirable: for instance, if we need to rank a set of documents, and we know that the user will explore all the documents anyway.
However, not every metric used in ranking satisfies the Priority axiom (Amigó et al. 2013). This is the case of some metrics such as Precision at N, Recall at N or Maximum Reciprocal Rank: P@N and R@N do not consider the order of documents before position N, and MRR does not consider the order of documents after the first relevant one. Therefore, according to Theorem 3 we can infer that they do not satisfy \(\hbox {EOM}_\mathtt {O}\): they do not fit into our definition of ranking metrics. Table 8 (last two columns) illustrates the ranking metrics.
In summary, again, most of the metrics used in ranking problems fit into our definition, with some exceptions whose limitations have been discussed in the literature.
8 Other tasks and metrics
The previous two sections show how the classical abstract tasks (classification, clustering, and ranking) and their metrics can be modeled in our framework. In this section we aim to demonstrate the generality of the framework, showing how it adapts also to other information access tasks, by (i) analyzing the metrics for the quantitation, and (ii) showing that the framework leads to ordinal classification, which has not yet been studied from an axiomatic perspective.
8.1 Quantitation: value- and equivalence-oriented, interval and ratio
The proposed framework gives us the opportunity of modeling tasks at the higher scale types \(\mathtt {I}\) and \(\mathtt {R}\). We now analyze the four quantitation variants shown in Table 7.
8.1.1 Quantitation-1: Value-oriented, interval
Let us analyse the effect of applying the metric definition for the interval scale type. For instance, a value-oriented metric for the interval scale type must be invariant under linear transformations, and it must increase when every assignment is value-closer to the goldstandard. Let us consider the widely used Mean Absolute Error (MAE). It is computed as the average difference between the \(\sigma\) and \(\alpha\) values:
Notice that the definition of this metric directly matches with the definition of closeness for these scale types. This metric satisfies VOM for both interval and ratio scale types. However, as it is, it does not fit into the definition of value-oriented metric, since \(\mathrm {VOI}_{\mathtt {I}}\) (as well \(\mathrm {VOI}_{\mathtt {R}}\)) does not hold. For instance, transforming both the assignments by multiplying them by a constant factor (a permissible transformation function) affects the average difference. However, the average error has always to be expressed in terms of a unit. For instance, a MAE of 2 when measuring temperature has no sense: one should say a MAE of 2 centigrades. That is, we need to incorporate a unit \(|\alpha (d_0)-\alpha (d^{\prime }_0)|\) which depends of empirical observations over a fixed pair of objects (e.g., a centigrade is a hundredth of the temperature difference between ice and water vapor). Then, applying a transformation also implies transforming the unit, and in this way the average error is invariant for the interval scale type. We can define MAE with a reference difference as:
After this definition, we can state the following theorem.
Theorem 10
(\(\hbox {VOI}_\mathtt {I}\), \(\hbox {VOM}_\mathtt {I}\) and \({{\,\mathrm{\text {MAE}}\,}}\)) The Mean Absolute Error with a reference difference is a value-oriented metric for the interval scale type, i.e.,\({{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RD}}\)satisfies\(\hbox {VOI}_\mathtt {I}\)and\(\hbox {VOM}_\mathtt {I}\).
8.1.2 Quantitation-2: Value-oriented, ratio
But in the ratio scale type we do not need a reference difference: a single reference object is enough (a meter has been defined in terms of a prototype meter bar). We can define the mean absolute error with a reference unit as:
This can be applied to ratio scaled dimensions such as length or speed. Now we can prove the following theorem.
Theorem 11
(\(\hbox {VOI}_\mathtt {R}\), \(\hbox {VOM}_\mathtt {R}\) and \({{\,\mathrm{\text {MAE}}\,}}\)) The Mean Absolute Error with a reference unit is a value-oriented metric for the ratio scale type, i.e.,\({{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RU}}\)satisfies\(\hbox {VOI}_\mathtt {R}\)and\(\hbox {VOM}_\mathtt {R}\).
8.1.3 Quantitation-3: Equivalence-oriented, interval
Let us consider the behaviour of an equivalence-oriented metric for the interval scale type. First it should be invariant under linear affine transformations of the system output. In addition, there must be a metric value increase if for every linear affine transformation for one assignment we can find a transformation for the other assignment which is value-closer to the gold. This is the case of the traditional Pearson correlation coefficient, which is defined as:
We can now prove the following theorem.
Theorem 12
(\(\hbox {EOI}_\mathtt {I}\), \(\hbox {EOM}_\mathtt {I}\) and Pearson) The Pearson correlation coefficient is an equivalence-oriented metric for the interval scale type, i.e., it satisfies\(\hbox {EOI}_\mathtt {I}\)and\(\hbox {EOM}_\mathtt {I}\).
8.1.4 Quantitation-4: Equivalence-oriented, ratio
Finally, we can also find equivalence oriented metrics at the ratio scale type. The most popular metric in this category is probably the cosine distance, defined as (where \(\overrightarrow{\sigma }=\langle \sigma (i_1),\ldots ,\sigma (i_n)\rangle\) and \(\overrightarrow{\alpha }=\langle \alpha (i_1),\ldots ,\alpha (i_n)\rangle\)):
The strength of this similarity criterion is that it is not affected by proportionality transformations of assignments. According to the state of the art, the cosine distance is a good estimator for document similarity (in this case each item is the frequency of a word in the document). We can prove the following theorem.
Theorem 13
(\(\hbox {EOI}_\mathtt {R}\), \(\hbox {EOM}_\mathtt {R}\) and cosine distance) Whenever assignment values are positive, the cosine distance is an equivalence-oriented metric for the ratio scale type, i.e., it satisfies\(\hbox {EOI}_\mathtt {R}\)and\(\hbox {EOM}_\mathtt {R}\).
Table 9 shows the properties satisfied by these metrics according to the above theorems and the implication relationships between properties stated in Lemma 2 (the last column of the table is discussed in the following).
8.2 Ordinal classification: Value-oriented, ordinal
In Sect. 2.3 we highlighted that the Ordinal Classification task has not yet been analysed in depth, although several evaluation campaigns match this task and several authors have analysed the most popular metrics and have made some proposals (Gaudette and Japkowicz 2009; Baccianella et al. 2009; Cardoso and Sousa 2011). Our framework leads to this problem when considering value oriented metrics at ordinal scale, and provides two properties to be satisfied by metrics: \(\hbox {VOI}_{\mathtt {O}}\) and \(\hbox {VOM}_{\mathtt {O}}\). The first one states that the metric must be invariant under strict increasing functions (permissible transformation functions for the ordinal scale type) applied over both the system output and the gold: if we keep the relative order of gold and system values, the metric must return the same result. \(\hbox {VOM}_{\mathtt {O}}\) states that approaching a value to the correct one must increase the system score.
Let us analyse the most popular metrics used in this task. On the one hand, the popular Accuracy metric is invariant at value-oriented nominal (\(\hbox {VOI}_{\mathtt {N}}\)) and, therefore, it is also invariant at value-oriented ordinal scale, but it does not satisfy monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)). That is, it does not capture closeness to the gold value. An attempt in the literature to solve this gap is by means of Accuracy with n (Gaudette and Japkowicz 2009) which relaxes the range of values for a response to be accepted as matching. However, this solution does not solve the monotonicity problem for larger ordinal differences. Other authors proposed to use correlation coefficients, such as Pearson, Spearman, or Kendall; in particular, non parametric ones such as Kendall and Spearman are invariant at the ordinal scale, but they do not satisfy monotonicity, given that the maximum value of 1 can be achieved without returning the correct values. The Normalized Distance Performance Measure (NDPM) (Yao 1995) has the same behavior.
On the other hand, the Mean Average Error (MAE) and also the Mean Square Error (MSE) have been applied to this problem. They satisfy monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)) but at the cost of invariance, given that they take into account the interval distance between system and gold assigned values.
Let us focus on a specific example and consider two documents \(d_1\) and \(d_2\), a gold \(\alpha\), and four system outputs \(\sigma _1, \sigma _2, \sigma _3, \sigma _4\) with the values shown in Table 10. Notice that \(\sigma _1\) resembles exactly the gold, \(\sigma _2\) does not hit the target with the values, but it keeps the correct ordering as well as the correct distance between the categories (they are adjacent in this case), and \(\sigma _3\) keeps the correct ordering but the second value moves further away from the correct result. Finally, \(\sigma _4\) does not reflect the correct order of values and neither the correct value matching. Then, a metric should satisfy:
Accuracy is not able to discriminate among \(\sigma _2\), \(\sigma _3\) and \(\sigma _4\), given that all these system outputs fail in both target values. Therefore, value-oriented metrics for the nominal scale type are not adequate, since being closer to the real target should increase the score. Non parametric correlation coefficients (Kendall, Spearman, etc) do not discriminate among \(\sigma _1\), \(\sigma _2\) and \(\sigma _3\), given that all of them sort the documents in the correct way. The reason is that it is not a ranking problem either, so equivalence-oriented metric for the ordinal scale type are not appropriate. The linear correlation coefficient Pearson, an equivalence-oriented metric at interval scale, shows the same non discriminating effect as Spearman. In addition it also assumes that there exists the same interval between each category. However, we know that there are more categories between “Positive” and “Negative” than between “Positive” and “Neutral”, but we can not assert that the distance is the double, which is assumed by the Pearson coefficient. MAE and MSE would sort systems in a correct manner, given that they satisify monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)). The limitation is that MAE and MSE are not invariant, and they require to assume an invariant value corresponding to each category, and thus predefined intervals between categories.
We claim that in this situation one must use value-oriented metrics for the ordinal scale type. The goal consists of reducing the distance between \(\sigma\) and \(\alpha\) values, but at the same time, the distance can be defined only in ordinal terms. The more a prediction is far away from the target in ordinal terms, the more the system is penalized. The question is how to satisfy monotonicity and invariance simultaneously. We leave this open issue as future work.
9 Conclusions and future work
9.1 Summary
In this paper, we have defined a theoretical framework that explains the nature of evaluation metrics by grounding on measurement theory. Besides exploiting the traditional measurement theory, we have also introduced the concepts of value-oriented and equivalence-oriented closeness, for all scale types (nominal, ordinal, interval, ratio).
The theoretical results derived from the framework are:
-
There is a clear correspondence between abstract tasks, metric kinds, and scale types. That is, classification, clustering, ranking, value prediction, and linear correlation oriented tasks can be interpreted as assignments for nominal, ordinal, interval, and ratio scale types, in which the closeness to the gold is evaluated at value or equivalence level.
-
The definitions of value- and equivalence-oriented evaluation metrics match with the basic axioms stated in the literature for particular abstract tasks (strict monotonicity for classification, homogeneity and heterogeneity for clustering, and swapping for ranking): we only need to instantiate over the different scale types to infer these axioms.
-
The proposed framework gives a single and unified explanation for most theoretical criticisms of classification, ranking, and clustering metrics found in the literature.
-
The proposed framework explains the need for an interval and ratio units (i.e., meters, grades, etc.) when assignments are compared at the interval and ratio scale types.
-
The proposed framework explains the popularity of Pearson coefficient and cosine distance when estimating the closeness of assignments at interval and ratio scale types.
Tables 8 and 9 summarize the aggregated analysis for all abstract tasks and metrics. Metrics used in different tasks match with the corresponding kind of metric according to our definition. The discarded metrics match with metric limitations actually identified in the literature. In addition, by filling the gap of value-oriented metrics for the ordinal scale type we understand how to evaluate tasks such as semantic textual similarity, recommendation, or polarity detection.
9.2 Practical consequences
We have already mentioned that our framework is not only theoretical. Let us summarize the practical contributions of this paper, mainly addressed to the communities of tasks and metric designers. First, the framework gives a tool for selecting and checking the suitability of metrics. One only needs to know: (i) in what scale the system output is defined, and (ii) if the goal consists of predicting values (i.e., classification, mean error, etc.) or relationships (i.e., clustering, ranking, or linear correlation). Second, the framework helps users to distinguish between necessary properties and properties that depend on the particular characterization of the task. The constraints found in the literature that match with our basic definitions of evaluation metric are strictly necessary for the corresponding abstract task. The other constraints depend on the particular task in which the evaluation is taking place. For instance, the priority constraint is a common desirable property to be satisfied by any ranking metric for the asbtract task of ranking, while top-heaviness is task-dependent, although necessary for IR. Third, the framework provides a tool for defining metrics in situations in which, nowadays, the lack of suitable metrics enforces the use of several metrics. The most clear example is the simultaneous use of Pearson and Spearman as well as error rate metrics when evaluating value prediction in an ordinal scale (i.e., sentiment polarity prediction).
9.3 Limits of this study and future developments
Our framework opens the door for evaluation metrics in empty theoretical spaces, such as value-oriented metrics for the ordinal scale type (that have not been proposed yet). However, it does not cover every scenario nor every metric property. The reason is that there exist particular axioms that depend on a particular task. One example is top-weightedness in the case of ranking. Another one is the ability of clustering metrics to avoid the combinatory effect of element pairs in big clusters (counting pairs metrics fail on this). In classification, metrics can be grouped into classes depending of how random or non informative outputs are evaluated: this also depends on the particular task. However, we find this as a common situation in science. The first example that comes to mind is the exclusion of Euclid’s fifth postulate to define different kinds of geometry.
Moreover, the framework is based on the assumption that system outputs and golds are assignments (of numerical values). This idea does not match with translation or summarization systems that generate text. Measurement theory has also a gap regarding the generation of structures. For instance, the inference of dependency trees from a text, or the construction of a hierarchical clustering do not fit into the idea of assignment. However, these tasks are often abstracted into classification or clustering abstract tasks.
In the literature one can find also composite metrics. For instance, diversity metrics in fact consider two golds simultaneously: the ordinal relevance of documents and their redundancy which is modeled as as assignment at the nominal scale type (information nuggets). These kinds of scenarios are not covered at the moment by the framework.
Finally, we have been focusing on the four classical scale types, but in measurement theory other scale types are proposed. Thus adding more tasks to our framework will be straightforward as long as they can be associated with a scale type. In this respect, we plan to further discuss the role of document filtering and its relations with other tasks.Footnote 8
Notes
To help intuition, we prefer to use “documents” instead of “items”, but let us remark that it is just a matter of terminology and all the results presented in the following are general and hold for any kind of item.
To avoid ambiguity, we use “quantitation” instead of its synonym “quantification” https://en.wikipedia.org/wiki/Quantification_(science), that in data and text mining “consists in providing an aggregate estimation (e.g., the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution” (González et al. 2017, p. 74).
It is easy to see that the last equivalence in (2) derives simply by applying (1) and observing that
$$\begin{aligned} {x} \mathrel {{\preccurlyeq }^{r}} {y}\wedge \lnot ({y} \mathrel {{\preccurlyeq }^{r}} {x})&\Longleftrightarrow |r-x|\le |r-y| \wedge \lnot |r-y|\le |r-x| \\&\Longleftrightarrow |r-x|\le |r-y| \wedge |r-x|<|r-y| \Longleftrightarrow |r-x|<|r-y| \Longleftrightarrow {x} \mathrel {{\prec }^{r}} {y}. \end{aligned}$$We are slightly abusing the notation here: the meaning of Formula (9) is that \({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {N}}} {\omega _1}\), \({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {N}}} {\omega _4}\), \({\omega _3}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {N}}} {\omega _1}\), and \({\omega _3}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {N}}} {\omega _4}\). A similar remark holds for the following Formula (11).
In the following, we use the specific scale type as a subscript of the property name to specify the property for that scale type when needed. So, for example, \(\mathrm {VOI}_{\mathtt {N}}\) is the VOI property for the nominal scale type.
This property is not used anywhere in this paper; we include it for completeness.
Even when the scale type of reference is \(\mathtt {O}\), by \(\sigma (d)>\sigma (d^{\prime })\) we mean that for the \(\sigma\) assignment d is more relevant than \(d^{\prime }\).
Document filtering, i.e., discriminating relevant against irrelevant documents, is often considered as a task. The issue is probably more complex: filtering can be interpreted as: (i) a classification task (but with an implicit preference order, as one needs to know which is the positive/higher class), (ii) as a sort of “singularity” of ranking task with two levels only, for which some classification metrics are suitable, or (iii) a mixture of different tasks.
In this paper we prefer to use the term “structure” in place of the original “system” to avoid confusion, since the latter is already used for “IR system” and “system output”.
We assume that both empirical and numerical relational structures are pairs (a set with a single relationship). To be truly general, we should consider empirical and numerical relational structures having: (i) more than one relationship, (ii) relationships of different arity, and (iii) also functions besides relationships. This would mainly be a technical exercise and we refrain to do so to avoid unnecessary technical complications.
The only case for which this would be different is the sometimes proposed absolute scale type, for which the only permissible transformation function is identity.
Note that by increasing rank (\({{\,\mathrm{rank}\,}}_{\sigma }(d)<{{\,\mathrm{rank}\,}}_{\sigma }(d^{\prime })\)) we decrease relevance (\(\sigma (d)>\sigma (d^{\prime })\)); see Footnote 12
References
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. (2015). SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on interpretability. In Proceedings of SemEval, 2015 (pp. 252–263).
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., et al. (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of SemEval, 2016 (pp. 509–523).
Amigo, E., Gonzalo, J., Verdejo, F., & Spina, D. (2019). A comparison of filtering evaluation metrics based on formal constraints. Information Retrieval Journal, 22(6), 581–619.
Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.
Amigó, E., Gonzalo, J., & Mizzaro, S. (2014). A general account of effectiveness metrics for information tasks: retrieval, filtering, and clustering. In Proceedings of SIGIR (pp. 1289–1289). ACM.
Amigó, E., Gonzalo, J., & Mizzaro, S. (2015). A formal approach to effectiveness metrics for information access: Retrieval, filtering, and clustering. In: ECIR 2015: Advances in information retrieval (pp. 817–821).
Amigó, E., Gonzalo, J., & Verdejo, F. (2011). A comparison of evaluation metrics for document filtering. In CLEF, LNCS (Vol. 6941, pp. 38–49). Springer.
Amigó, E., Gonzalo, J., & Verdejo, F. (2013). A general evaluation measure for document organization tasks. In Proceedings of SIGIR (pp. 643–652). ACM, New York, NY, USA.
Amigó, E., Gonzalo, J., Spina, D., & Verdejo, F. (2018). A comparison of filtering evaluation metrics based on formal constraints. Information Retrieval. (in press).
Baccianella, S., Esuli, A., & Sebastiani, F. (2009). Evaluation measures for ordinal regression. In Proceedings of the 2009 ninth international conference on intelligent systems design and applications, ISDA ’09 (pp. 283–287).
Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., & Patti, V. (2016). Overview of the Evalita 2016 SENTIment POLarity classification task. In Proceedings of third Italian conference on computational linguistics (CLiC-it 2016).
Bollmann, P. (1984). Two axioms for evaluation measures in information retrieval. In SIGIR ’84 (pp. 233–245). Swinton: British Computer Society.
Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of ICTIR 2013 (pp. 22–29). ACM.
Cardoso, J. S., & Sousa, R. (2011). Measuring the performance of ordinal classification. International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1173–1195. https://doi.org/10.1142/S0218001411009093.
Della Mea, V., & Mizzaro, S. (2004). Measuring retrieval effectiveness: A new proposal and a first experimental validation. Journal of the American Society for Information Science, 55(6), 530–543.
Dom, B. (2001). An information-theoretic external cluster-validity measure. IBM Research Report. http://citeseer.ist.psu.edu/dom01informationtheoretic.html.
Ferrante, M., Ferro, N., & Maistro, M. (2015). Towards a formal framework for utility-oriented measurements of retrieval effectiveness. In Proceedings of ICTIR (pp. 21–30).
Ferrante, M., Ferro, N., & Pontarollo, S. (2017). Are IR evaluation measures on an interval scale? In Proceedings of ACM ICTIR. (pp. 67–74). ACM.
Ferrante, M., Ferro, N., & Pontarollo, S. (2019). A general theory of IR evaluation measures. IEEE Transactions on Knowledge & Data Engineering, 31(3), 409–422.
Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27–38.
Fuhr, N. (2018). Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3), 32–41.
Gaudette, L., & Japkowicz, N. (2009). Evaluation methods for ordinal classification. In Y. Gao & N. Japkowicz (Eds.), Advances in Artificial Intelligence. Canadian AI 2009. Lecture Notes in Computer Science (Vol. 5549). Berlin, Heidelberg: Springer.
González, P., Castaño, A., Chawla, N. V., & Coz, J. J. D. (2017). A review on quantification learning. ACM Computing Surveys, 50(5), 74:1–74:40.
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.
Maddalena, E., & Mizzaro, S. (2014). Axiometrics: Axioms of information retrieval effectiveness metrics. In Proceedings of the sixth EVIA workshop (pp. 17–24).
Maddalena, E., Mizzaro, S., Scholer, F., & Turpin, A. (2017). On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems (TOIS), 35(3), 19:1–19:32.
Meila, M. (2003). Comparing clusterings. In Proceedings of COLT 03.
Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In AIRS’13 (pp. 1–12).
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: an integrated approach. Newark: Lawrence Erlbaum Associates.
Qi, H., Yang, M., He, X., & Li, S. (2010). Re-examination on Lam% in spam filtering. In Proceedings of SIGIR.
Reimers, N., Beyer, P., & Gurevych, I. (2016). Task-oriented intrinsic evaluation of semantic textual similarity. In Proceedings of COLING, 2016 (pp. 87–96).
Roberts, F. (1984). Measurement theory: volume 7: with applications to decisionmaking, utility, and the social sciences. Encyclopedia of mathematics and its applications. Cambridge: Cambridge University Press.
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of EMNLP-CoNLL (pp. 410–420).
Rosenthal, S., Farra, N., & Nakov, P. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of SemEval ’17. ACL.
Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In Proceedings of ICTIR 2015 (pp 11–20). ACM.
Sokolova, M. (2006). Assessing invariance properties of evaluation measures. In Proceedings of NIPS’06 workshop on testing deployable learning and decision systems
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–80.
Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. Handbook of mathematical psychology (Vol. 1, pp. 3–76). New York: Wiley.
Van Rijsbergen, K. (1974). Foundation of evaluation. Journal of Documentation, 30(4), 365–373.
van Rijsbergen, K. (1981). Retrieval effectiveness (pp. 32–43). London: Butterworths. (chap 3).
Wu, W., Xiong, H., & Shekhar, S. (Eds.). (2003). Clustering and information retrieval. Alphen aan den Rijn: Kluwer.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of SIGIR (pp. 267–273). ACM.
Yao, Y. (1995). Measuring retrieval effectiveness based on user preference of documents. JASIS, 46, 133–145.
Acknowledgements
This research has been partially supported by Grants Vemodalen (TIN2015-71785-R) and MISMIS (PGC2018-096212-B-C32) from the Spanish government, as well as by the Google Research Award Axiometrics: Foundations of Evaluation Metrics in IR.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Measurement theory
In this appendix, we recall some basic concepts in measurement theory. Measurement theory is the discipline that studies the theoretical foundations of the activity of measuring. It is not a new field, and its roots are usually dated back to the seminal work by Stevens (1946). We try to make this paper self-contained: we therefore sacrifice some of the most technical issues when not necessary to our aims, as well as abuse the notation sometimes, and we provide some examples to help the intuition. A more formal and complete account can be found in the classical works by Suppes and Zinnes (1963), Roberts (1984), or Pedhazur and Schmelkin (1991).
1.1 Assignments and measurements
In measurement theory a measurement is defined as a function that assigns real numbers to elements of a set \({\mathcal {X}}\) of objects or events in the real world. To be a measurement, the function must be a homomorphism, i.e., it must be such that the relationships in the so called empirical relational structureFootnote 9 (i.e., the real world) are preserved in the numerical relational structure (i.e., the codomain of the function, the set of real numbers \({{\,{{\mathbb {R}}}\,}}\)). For example, when measuring the weight of objects, we can have in the real world a relationship R that stands for “heavier than” and an operation \(\circ\) of “combination” (considering two objects together). A measurement will assign numbers such that R corresponds to “>” and \(\circ\) to “\(+\)”, thus defining a homomorphism from the empirical relational structure (the tuple \(\langle {\mathcal {X}}, R, \circ \rangle\)) to the numerical relational structure (\(\langle {{\,{{\mathbb {R}}}\,}}, >, + \rangle\)).
For our purposes, we can start from a simpler situation: we define an assignment as a function that assigns a real number to a characteristic of an object or event.
Definition A.1
(Assignment) An assignment \(\omega \in \varOmega\) (we denote with \(\varOmega\) the set of all possible assignments) is a function that assigns values to objects in a set \({\mathcal {D}}\):
Then, to be a measurement, an assignment must be a homomorphism (Roberts 1984; Suppes and Zinnes 1963; Pedhazur and Schmelkin 1991).
Definition A.2
(Measurement) Let \(\langle {\mathcal {D}}, R_{{\mathcal {D}}}\rangle\) be an empirical relational structure and \(\langle {{\,{{\mathbb {R}}}\,}}, R_{{{\,{{\mathbb {R}}}\,}}}\rangle\) be a numerical relational structure.Footnote 10 A measurement \(\psi \in \varPsi\) (we denote with \(\varPsi \subset \varOmega\) the set of all possible measurements) is an assignment of objects in a set \({\mathcal {D}}\)
that is a homomorphism between the empirical relational structure and the numerical relational structure, i.e., \(\forall x,y \in {\mathcal {D}}\left( R_{{\mathcal {D}}}(x,y) \Longrightarrow R_{{{\,{{\mathbb {R}}}\,}}}(\psi (x),\psi (y))\right)\).
This definition is similar to the classical ones that can be found in the literature (Roberts 1984, pp. 51–52; Suppes and Zinnes 1963, pp. 4–8; Pedhazur and Schmelkin 1991, p. 17); it is very general and includes different kinds of measurements. Let us see some examples to emphasize this fact.
Example A.1
Let us consider three objects \({\mathcal {D}}= \{ o_1, o_2, o_3\}\), and the measurement \(\psi _t\) of their temperature in Celsius degrees (we represent the function \(\psi _t\) as a set of pairs); then \(\psi _t = \{ \langle o_1, -5 \rangle , \langle o_2, 5 \rangle , \langle o_3, 15 \rangle \}\).
Example A.2
Let us consider three products \({\mathcal {D}}= \{ p_1, p_2, p_3\}\), and the quality measurement \(\psi _q\) into high (we use the real number 2 for this), medium (1) or low (0); then \(\psi _q = \{ \langle p_1, 0 \rangle , \langle p_2, 1 \rangle , \langle p_3, 2 \rangle \}\).
Example A.3
Let us consider three students \({\mathcal {D}}= \{ s_1, s_2, s_3\}\), and a measurement \(\psi _c\) which assigns each student to a class (foreign, local) identified with a number (1 = foreign, 2 = local): then \(\psi _c = \{ \langle s_1, 1 \rangle , \langle s_2, 2 \rangle , \langle s_3, 2 \rangle \}\).
It is important to remark that measurements are assignments (with further requirements). All the above \(\psi _t\), \(\psi _q\), and \(\psi _c\) are of course assignments; to be measurements they need to be “correct”, in order to satisfy the homomorphism requirement. For example, the assignment \(\psi ^{\prime }_t = \{ \langle o_1, 5 \rangle , \langle o_2, -5 \rangle , \langle o_3, 15 \rangle \}\) would not be a measurement when using Celsius degrees, assuming that the values in \(\psi _t\) are indeed correct.
Given a measurement, one can of course make some inferences from it and derive some further properties such as equalities, greater than / less than relationships, differences, ratios, ratios of differences, etc. These properties have been named “numerical statements” (Suppes and Zinnes 1963) or simply “statements” (Roberts 1984). We use the latter term and we denote as \(\psi ^{*}\) the set of statements that can be derived from the values assigned by \(\psi\) to the set of objects in \({\mathcal {D}}\). That is, \(\psi ^{*} = \{ x ~|~\psi \vDash x \}\). For instance, going back to our Example A.1, \(\psi _{t}^{*}\) contains statements like: \(\forall i \in \{2,3\} (\psi _t(o_i)>0)\) (“temperatures of \(o_2\) and \(o_3\) are above zero”); \(\psi _t(o_3)> \psi _t(o_2) > \psi _t(o_1)\); \(\psi _t(o_3) = 3 \cdot \psi _t(o_2)\); and \(\frac{\psi _t(o_3)-\psi _t(o_2)}{\psi _t(o_2) - \psi _t(o_1)}=1\).
Let us remark that such statements can be either “value oriented” (i.e., \(\psi _t(o_3) = 15\)) or “relationship oriented” (i.e., \(\psi _t(o_3) = 3 \cdot \psi _t(o_2)\) or \(\psi _t(o_2) < \psi _t(o_3)\)). This distinction is reconsidered and further discussed in the paper. Provided that there are no errors, all the statements derived from a measurement are numerically correct. For instance, in our Example A.1, \(\psi _t(o_3) = -3 \cdot \psi _t(o_1)\) is numerically correct, while \(\psi _t(o_3) = 4 \cdot \psi _t(o_1)\) is not. However, it makes no sense to assert that \(o_3\) is minus three times hotter than \(o_2\); whereas we can assert that there is the same difference of temperature between \(o_1\), \(o_2\) and \(o_3\). That is, the statement \(\frac{\psi _t(o_3)-\psi _t(o_2)}{\psi _t(o_2) - \psi _t(o_1)}=1\) is numerically correct and makes sense, as well as the comparison \(\psi _t(o_3)>\psi _t(o_2)\). In an analogous way, in the case of the Example A.2, it makes no sense to state that there is the same difference in quality between the products \(p_1\), \(p_2\) and \(p_3\) although it is numerically true. We only know that some products have a higher quality than other products (\(\psi _q(p_3)>\psi _q(p_2)>\psi _q(p_1)\)). Finally, in the example of students and classes, there is no ordering relationship between classes: we can only say that two students belong to the same class or not.
According to the terminology used in measurement theory, we can summarize the above observations by saying that not all statements are meaningful. Meaningful statements are modeled in measurement theory on the basis of the concepts of levels, or scales.
1.2 Scales and scale types
The concepts of scale and scale type are central in measurement theory. The most typical categorization of scales is probably the one going from Nominal (\(\mathtt {N}\), the lowest scale type), through Ordinal (\(\mathtt {O}\)) and Interval (\(\mathtt {I}\)), to Ratio (\(\mathtt {R}\), the highest scale type). In our three examples above, temperature, product quality, and class assignment are \(\mathtt {I}\), \(\mathtt {O}\), and \(\mathtt {N}\) scale types respectively. Other scales commonly used for temperature are of scale type \(\mathtt {I}\), like Fahrenheit, or \(\mathtt {R}\), like Kelvin. The choice of scale types is somehow arbitrary in the sense that there is not a single choice of which ones to define.
More formally, each scale type can be defined by means of an associated set of permissible transformation functions: each scale type corresponds to a particular set of functions.
Definition A.3
(Permissible transformation functions) The sets \({\mathcal {F}}_{\mathtt {T}}\) of permissible transformation functions for each scale type \(\mathtt {T}\in \{\mathtt {N, O, I, R}\}\) are (see also Table 11):
-
\({\mathcal {F}}_{\mathtt {N}} =\{ f ~|~ f \text { is bijective, i.e., } x=y \text { iff } f(x) = f(y)\}\);
-
\({\mathcal {F}}_{\mathtt {O}} =\{ f ~| f \text { is strictly increasing monotonic, i.e., } \text { if } x<y \text { then } f(x) < f(y)\}\);
-
\({\mathcal {F}}_{\mathtt {I}} =\{ f ~|~ f \text { is a linear affinity, i.e., } f(x) = a \cdot x+b, a>0\}\);
-
\({\mathcal {F}}_{\mathtt {R}} =\{ f ~|~ f \text { is linear, i.e., } f(x) = a \cdot x, a>0\}\).
We also write \(f(\psi )\):
Given these permissible transformation functions, the potential measurements over a set of objects are grouped into equivalence classes according to the following equivalence relation.
Definition A.4
(Measurement equivalence) Two measurements \(\psi\) and \(\psi ^{\prime }\) are equivalent for the scale type \(\mathtt {T}\) if there exists a permissible transformation function \(f\in {\mathcal {F}}_\mathtt {T}\) such that \(\psi\) and \(f(\psi ^{\prime })\) are identical:
The set of all equivalent measurements is the equivalence class \([\psi ]_\mathtt {T}\).
The basic idea is that permissible transformation functions determine if two measurements are equivalent or not. For the nominal scale type, measurements which distribute objects into the same groups are equivalent, given that there exists a bijective function which transforms values from one measurement to the other. For instance, the student distribution in Example A.3 is equivalent to \(\{\langle s_1,4\rangle\), \(\langle s_2,3\rangle\), \(\langle s_3,3\rangle \}\). The bijective function \(f(x)=5-x\) transforms one measurement into another. For the ordinal scale type, measurements which order objects in the same way are equivalent. For instance, the measurement \(\{\langle p_1,4\rangle\), \(\langle p_2,7\rangle\), \(\langle p_3,10\rangle \}\) is equivalent to the product quality measurement in Example A.2. The monotonic transformation function which confirms that the two measurement are equivalent is \(f(x)=4+3 \cdot x\).
These examples illustrate two characteristics of measurement scale types. The first one is that they are additional features of measurements. That is, two measurements defined on different scale types could assign the same numbers to objects, and therefore, the derived statements are identical, but the set of meaningful statements would not be the same.
The second characteristic is that there exists a natural subsumption across measurement scale types. For instance, equality relations are meaningful in every scale type. Greater than and less than make sense in every scale type except in \(\mathtt {N}\). The difference between two values makes no sense in \(\mathtt {N}\) nor in \(\mathtt {O}\). In other words, the more we consider “higher” scale types, the more we can state meaningful statements. That is, the permissible transformation functions for high measurement scale types (e.g., ratio, or interval) are also transformation functions for low measurement scale types (e.g., ordinal or nominal):
Therefore, there exists also a subsumption between equivalence of measurements across scale types. That is, two equivalent measurements for a high scale type are also equivalent for lower scale types:
We denote the ordering relationship on the set of the four scale types as
and we speak of higher and lower scale types accordingly.
We briefly mention a property that is further discussed in the rest of the paper: there is a dependence between abstract tasks and scale types. Roughly, classification and clustering are based on the nominal scale type; ranking on the ordinal, and quantitation on the interval and ratio. However this dependence is not clear cut, as we discuss in more detail in the paper.
1.3 From assignment to measurements and back
Let us remark that although measurement theory is focused on measurements, some parts of it can be immediately generalized to assignments. More specifically, the notions of equivalent measurements and of equivalence class \([\psi ]_\mathtt {T}\) of a measurement \(\psi\) of Definition A.4 can be generalized to assignments in a straightforward way. Thus, given an assignment, it is possible to speak of an equivalent assignment for a given scale type\(\mathtt {T}\). The following definition formalizes the simple generalization of Definition A.4.
Definition A.5
(Assignment equivalence) Two assignments \(\omega\) and \(\omega ^{\prime }\) are equivalent for the scale type \(\mathtt {T}\) if there exists a permissible transformation function \(f\in {\mathcal {F}}_\mathtt {T}\) such that \(\omega\) and \(f(\omega ^{\prime })\) are identical:
The set of all equivalent assignments is the equivalence class \([\omega ]_\mathtt {T}\).
The reason behind this definition is that if an assignment is not a measurement, then it means that there is no homomorphism between the empirical and numerical relational spaces, i.e., the assignment values are not correct w.r.t. the “real situation” (since measurement theory is interested in correct measurements, these cases are usually not taken into account); but in Definitions A.4 and A.5 there is no reference to the homomorphism.
1.4 Meaningfulness
We can now address the last element of measurement theory of interest here, namely the so called meaningfulness problem (Suppes and Zinnes 1963): as we have seen, given a measurement \(\psi\), not all the statements in \(\psi ^{*}\) (sometimes not even some of those in \(\psi\)) are meaningful.
Commonly, measurement scale types are described in terms of what statements can be stated over two or more objects: the nominal scale type focusses on the equality relationships, the ordinal scale type incorporates the greater than and less than relationships between values, the interval scale type adds ratios of differences, and the ratio scale type includes ratios. Suppes and Zinnes (1963) define a “meaningful numerical statement” as a numerical statement which is constant under permissible transformation functions, that is, under an equivalent measurement. In this paper, we refer to this as Meaningful Statement .
Definition A.6
(Meaningful Statement ) Given a measurement \(\psi\) of scale type \(\mathtt {T}\), the set of its Meaningful Statement s, denoted as \(\widehat{\psi ^{*}_\mathtt {T}}\), contains the statements that are invariant across equivalent measurements:
Figure 2 shows how equivalence and meaningfulness are related to each other. At the top left of the figure, there is the measurement \(\psi\). On its right there are its equivalent measurements (\(\psi _i\)) on the same domain and for the same scale type \(\mathtt {T}\). All of them belong to the equivalence class \([\psi ]_\mathtt {T}\). For instance, they could be different measurement of the ordinal scale type which keep the same order across object values. Each measurement produces different derived statements (\(\psi ^{*}, \psi _i^{*}\)), but the meaningful statements for the scale type \(\mathtt {T}\) are those that are shared by every derived measurement.
According to this definition, we can infer certain meaningful statements for each scale type. Figure 3 shows a graphical representation of some meaningful statements for each scale type. As the figure shows, there is a subsumption across scale types, being the equality the most basic relationship in the nominal scale type. It is important to remark that those listed in the figure are not all the possible meaningful statements. For example, one could compute the arithmetic mean of some temperatures in Example A.1 as \({{\,\mathrm{mean}\,}}(\psi _t(o_1),\psi _t(o_2))\)\(=0\), \({{\,\mathrm{mean}\,}}(\psi _t(o_2),\psi _t(o_3))=10\) and state that the difference between the two means is the same as \(\psi _t(o_2)-\psi _t(o_1)\), and this would be true under any permissible transformation function in \({\mathcal {F}}_{\mathtt {I}}\), i.e., meaningful.
The most important aspect of this figure is the grey area that highlights how some statements for certain scale types are not meaningful. For instance, the ordinal relationship between values (\(a>b\)) is not invariant across equivalent measurements of the nominal scale type. Therefore it is hidden by the grey area.
As shown in the last column of Table 11, equality is a boolean 2-ary meaningful relationships for the \(\mathtt {N}\) scale type; greater than (\(\ge\)) is a meaning relationship for the \(\mathtt {O}\) scale type; the ratio of differences is a 5-ary meaningful relationship for the \(\mathtt {I}\) scale type; and the ratio is a 3-ary meaningful relationship for the \(\mathtt {R}\) scale type.
It is also important to remark that, perhaps surprisingly but obviously, for any measurement \(\psi\), the assignments of single values in \(\psi\) (e.g., \(\psi (d_1)=6\)) are never meaningful:Footnote 11 none of them is invariant under the corresponding permissible transformation functions. In other terms, to find something meaningful one needs to look inside the set of derived numerical statements \(\psi ^{*}\). For example, \(\psi _t(d_1) =-5\) of Example A.1 is false when changing to Fahrenheit, although “the real temperatures are always the same”. This statement captures something that is not “in the empirical relational structure” (Suppes and Zinnes 1963), and does not represent properties of the “true” temperatures; rather, it is an artefact of a particular measurement approach.
It is also important to understand that meaningfulness is not always an important, or even desired, property. If one is given the task “measure temperature using the Celsius scale”, then providing an equivalent measurement in Fahrenheit is not a satisfactory achievement of the task. Also this issue is discussed in the paper.
Appendix B: Proofs
Proof of Lemma 1
We need to prove that a value \(x \in {{\,{{\mathbb {R}}}\,}}\) is (non-strictly) closer to a reference r than y for each scale type \(\mathtt {T}\) respectively if and only if the conditions in Table 5 are satisfied.
According to Definition 2, a value \(x \in {{\,{{\mathbb {R}}}\,}}\) is (non-strictly) closer to a reference r than y for a certain scale type \(\mathtt {T}\) if there exists a permissible transformation function in \({\mathcal {F}}_{\mathtt {T}}\) such that it is (non-strictly) closer. We prove each scale type independently.
Let us consider the nominal case and non-strict closeness. By contraposition:
Therefore, for any nominal transformation function applied over r, x and y, given that they preserve the equalities and inequalities:
and therefore
Let us consider the other direction:
First, note that:
If \(r\ne y\) we can find a permissible transformation function such that \(f(r)\ge f(x) > f(y)\). If \(r=x=y\), then the identity transformation also satisfies \(f(r)\ge f(x) \ge f(y)\). Therefore, \({x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {N}}}} {y}\).
The strict case can be inferred starting from the definition in Formula (4) and by exploiting the non-strict case:
Let us turn to the ordinal case. We have to prove that:
We start by observing that:
In all these three cases, we can define a permissible transformation function for the ordinal scale type such that:
Therefore:
On the other hand, if \(\lnot (\lnot (r\ge y> x \vee x> y \ge r))\), then \(r\ge y > x\) or \(x> y \ge r\). Given that in both cases y is strictly closer to r than x, we can not find a permissible ordinal transformation function such that \(|f(r)-f(x)|<|f(r)-f(y)|\), and therefore \(\lnot ({x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {O}}}} {y})\). Then, by contraposition:
The strict case is easily derived, again starting from (4) and by exploiting the non-strict case:
Finally, regarding the scale types \(\mathtt {I}\) and \(\mathtt {R}\), we want to prove that:
We can prove that the condition on the right hand side is invariant under the permissible transformation functions for the ratio scale type:
Therefore, the condition is valid for the \(\mathtt {R}\) scale type. In addition, we know that adding a constant does not affect the absolute difference \((|r-x|=|r+b-(x+b)|)\). Therefore the condition is invariant also for the interval scale type:
Therefore
and
The proof for the strict case is immediate. \(\square\)
Proof of Lemma 2
We prove each item in the lemma independently, with the exception of items (c) and (d) that are proved together.
-
(a)
\(\mathrm {VOI}_{\mathtt {T}}\) states the property in Formula (13) for any \(f\in {\mathcal {F}}_{\mathtt {T}}\). Obviously, the same property holds for f in any subset of \({\mathcal {F}}_{\mathtt {T}}\), and if \(\mathtt {T} < \mathtt {T'}\) then \({\mathcal {F}}_{\mathtt {T'}} \subset {\mathcal {F}}_{\mathtt {T}}\) (see Formulas (30) and (31)). In other terms, if a metric value does not change by applying any transformation function of a certain scale type, then the metric is also invariant for transformation functions of higher scale types: satisfying VOI for a certain scale type implies satisfying VOI for higher scale types.
-
(b)
The proof for EOI is analogous to that for VOI: see the previous case (a) and start from Formula (15).
-
(c)
This item is proved together with the next one.
-
(d)
The proof is divided into two parts. First, we focus on the nominal scale type and we prove that \(\mathrm {VOI}_\mathtt {N}\) is incompatible with both VOM (item (c) of the lemma) and EOM (item (d) of the lemma) at higher scale types. Let us assume that \(\mathrm {VOI}_\mathtt {N}\) holds. Then, by reduction to absurdity, we prove that if VOM or EOM hold, then we find a contradictory implication and therefore, they are incompatible. Let us consider the following particular assignments for \({\mathcal {D}} = \{d_1, d_2, d_3\}\):
$$\begin{aligned} \sigma&=(1,2,3)\\ \sigma ^{\prime }&=(6,5,4)\\ \alpha&=(10,20,30). \end{aligned}$$Note that if we apply the nominal permissible transformation function \(x \mapsto f(x)\) defined in Table 12, then \(\forall d \in {\mathcal {D}}\)\(f(\sigma (d))=\sigma ^{\prime }(d)\), \(f(\sigma ^{\prime }(d))=\sigma (d)\), and \(f(\alpha (d))=\alpha (d)\). Now, if a metric satisfies \(\mathrm {VOI}_\mathtt {N}\), this implies that the metric must be invariant under any nominal transformation \(f \in {\mathcal {F}}_\mathtt {N}\) applied over both the system output and the gold. Therefore:
$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha )={\mathcal {M}}(f(\sigma ),f(\alpha ))= {\mathcal {M}}(\sigma ^{\prime },\alpha ) \end{aligned}$$(32)(when clear from the context we omit the subscript on the metric). On the other hand,
$$\begin{aligned} \forall d \in {\mathcal {D}} \left( \sigma (d)<\sigma ^{\prime }(d)<\alpha (d)\right) . \end{aligned}$$(33)Therefore, according to the closeness definition at ordinal, interval, and ratio scale types, for any \(d \in {\mathcal {D}}\), \(\sigma ^{\prime }(d)\) is closer to \(\alpha (d)\) than \(\sigma (d)\), i.e., \(\forall \mathtt {T} \in \{\mathtt {O, I, R}\} ({\sigma ^{\prime }}\mathrel {\vartriangleleft ^{\alpha }_{\mathtt {T}}} {\sigma })\). Now, by reduction to absurdity, if we assume that VOM at \(\mathtt {O}\), \(\mathtt {I}\), or \(\mathtt {R}\) scale types holds, we can derive that
$$\begin{aligned} {\mathcal {M}}(\sigma ^{\prime },\alpha )>{\mathcal {M}}(\sigma ,\alpha ) \end{aligned}$$(34)(see Formula (14)), which contradicts (32). Therefore VOM can not be satisfied at higher scale types than nominal. This proves item (c).
Now, let us prove that EOM can not be satisfied at higher than nominal scale types. We can assert that \(\sigma ^{\prime }(d_1)>\sigma ^{\prime }(d_2)>\sigma ^{\prime }(d_3)\) and \(\alpha (d_1)<\alpha (d_2)<\alpha (d_3)\). Therefore, \(\sigma ^{\prime }\) and \(\alpha\) are not equivalent at \(\mathtt {O}\), \(\mathtt {I}\), or \(\mathtt {R}\) scale types. This implies that the assignment \(\alpha\) is strictly equivalence-closer to itself than \(\sigma ^{\prime }\) to \(\alpha\). On the other hand, the ratio, interval, and ordinal permissible transformation function \(f(x)=10*x\) makes \(f(\sigma (i))=\alpha (i)\).
Therefore, for any transformation \(f'(\sigma ^{\prime })\) at \(\mathtt {O}\), \(\mathtt {I}\), or \(\mathtt {R}\), there exists a transformation \(f(\sigma )=\alpha\) that is value-closer to \(\alpha\) than \(f'(\sigma ^{\prime })\), whereas the converse does not hold: according to Formulas (7) and (8), \(\forall \mathtt {T} \in \{\mathtt {O, I, R}\} ( {\sigma }\mathrel {\sqsubset ^{\alpha }_{\mathtt {T}}} {\sigma ^{\prime }} )\). Therefore, according to EOM at \(\mathtt {O}\), \(\mathtt {I}\), or \(\mathtt {R}\) (see Formula (16)),
$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha ), \end{aligned}$$(35)which again contradicts (32): also EOM can not be satisfied at higher scale types than nominal. Now we turn to the second part of the proof, which has a similar structure: we focus on the ordinal scale type and we prove, again by reduction to absurdity, that \(\mathrm {VOI}_\mathtt {O}\) is incompatible with both VOM and EOM at higher scale types. Let us consider the following particular assignments:
$$\begin{aligned} \sigma&=(1,2,3)\\ \sigma ^{\prime }&=(10,100,1000)\\ \alpha&=(10000,20000,30000) . \end{aligned}$$Note that if we apply the ordinal permissible transformation function \(x \mapsto f(x)\) in Table 13, then \(f(\sigma )=\sigma ^{\prime }\) and \(f(\alpha )=\alpha\). Now, satisfying \(\mathrm {VOI}_\mathtt {O}\) implies that the metric must be invariant under any ordinal permissible transformation function \(f \in {\mathcal {F}}_\mathtt {O}\) applied over both the system output and the gold, i.e., Formula (32) must hold for any such f. On the other hand, (33) still holds. Therefore, according to the closeness definition at interval and ratio scale types, for any \(d \in {\mathcal {D}}\), \(\sigma ^{\prime }(d)\) is closer to \(\alpha (d)\) than \(\sigma (d)\), i.e., \(\forall \mathtt {T} \in \{\mathtt {I, R}\} ({\sigma ^{\prime }}\mathrel {\vartriangleleft ^{\alpha }_{\mathtt {T}}} {\sigma })\). Now, if by reduction to absurdity we assume that VOM at \(\mathtt {I}\) or \(\mathtt {R}\) scale types holds, we can derive (34) (see Formula (14)), which contradicts (32): VOM can not be satisfied at higher scale types than ordinal. In addition, the interval and ordinal permissible transformation function \(f(x)=10000*x\) makes \(f(\sigma )=\alpha\). And given that \(\sigma ^{\prime }\) and \(\alpha\) are not linearly correlated, since
$$\begin{aligned} \sigma ^{\prime }(i)=10^{(\alpha (i)/10000)}, \end{aligned}$$we can assert that \(\sigma ^{\prime }\) and \(\alpha\) are not equivalent at \(\mathtt {I}\) or \(\mathtt {R}\) scale types. Therefore, for any transformation \(f'(\sigma ^{\prime })\) at \(\mathtt {I}\) or \(\mathtt {R}\), there exists a transformation \(f(\sigma )=\alpha\) that is value-closer to \(\alpha\) than \(f'(\sigma ^{\prime })\), whereas the converse does not hold: according to Formulas (7) and (8), \(\forall \mathtt {T} \in \{\mathtt {I, R}\} ( {\sigma }\mathrel {\sqsubset ^{\alpha }_{\mathtt {T}}} {\sigma ^{\prime }} )\). Therefore, according to EOM at \(\mathtt {I}\) or \(\mathtt {R}\) (see Formula (16)) we have that (35) holds, which again contradicts (32): also EOM can not be satisfied at higher scale types than \(\mathtt {O}\).
-
(e)
The proof is again by reduction to absurdity. Let us consider the following assignments:
$$\begin{aligned} \sigma&=(1,2,3)\\ \sigma ^{\prime }&=(10,10,30)\\ \alpha&=(10,20,30) . \end{aligned}$$\(f(x)=10*x\) is a permissible transformation function for every scale type. Given that \(\alpha (i)=\sigma (i)*10\), we can say that \(\sigma\) and \(\alpha\) are equivalent at every scale type, whereas there does not exist any bijective relationship between \(\sigma ^{\prime }\) and \(\alpha\), i.e., there does not exist any nominal permissible transformation function between \(\sigma ^{\prime }\) and \(\alpha\). Therefore, there does not exist any permissible transformation at any scale type. It follows that, for any transformation at any scale type of \(\sigma ^{\prime }\), the transformation \(\sigma (i)*10\) is value-closer to \(\alpha\). Thus, we can say that for every scale type, \(\sigma\) is equivalence-closer to \(\alpha\) than \(\sigma ^{\prime }\). Then, if EOM is satisfied by \({\mathcal {M}}\) at any scale type, we obtain:
$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha ). \end{aligned}$$(36)On the other hand, \(\sigma ^{\prime }\) is strict value-closer to \(\alpha\) than \(\sigma ^{\prime }\) at nominal scale type, given that \(\sigma (d_1)\ne \sigma ^{\prime }(d_1)=\alpha (d_1)\), \(\sigma (d_2)\ne \sigma ^{\prime }(d_2)\ne \alpha (d_2)\) and \(\sigma (d_3)\ne \sigma ^{\prime }(d_3)=\alpha (d_3)\). Then, by subsumption of scale types, we can say that \(\sigma ^{\prime }\) is strict value-closer than \(\sigma\) at every scale type. Therefore, if \({\mathcal {M}}\) satisfies VOM at any scale type then:
$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha )<{\mathcal {M}}(\sigma ^{\prime },\alpha ), \end{aligned}$$which contradicts the EOM implication (36).
-
(f)
It is enough to consider that, according to Lemma 1, the conditions for closeness and strict closeness at interval and ratio scale types are equivalent.
-
(g)
The proof is by reduction to absurdity. Let us consider the following assignments:
$$\begin{aligned} \sigma&=(1,2,3)\\ \sigma ^{\prime }&=(10,20,30)\\ \alpha&=(10,20,30) . \end{aligned}$$We note that two equal values are strictly closer than two different values for every scale type. Therefore, \({\sigma ^{\prime }}\mathrel {\vartriangleleft ^{\alpha }_{\mathtt {T}}} {\sigma }\) for every scale type \(\mathtt {T}\). Therefore, according to \(\mathrm {VOM}_{\mathtt {T}}\):
$$\begin{aligned} {\mathcal {M}}(\sigma ^{\prime },\alpha )&<{\mathcal {M}}(\sigma ,\alpha ). \end{aligned}$$On the other hand, \(\sigma\) and \(\sigma ^{\prime }\) are equivalent at every scale type, given that the ratio permissible function \(f(x)=10 * x\) states \(f(\sigma (i))=\sigma ^{\prime }(i)\). By subsumption across scale types, \(\sigma\) and \(\sigma ^{\prime }\) are equivalent at every scale type. Therefore, according to \(\mathrm {EOI}_\mathtt {T}\) at every scale type:
$$\begin{aligned} {\mathcal {M}}(\sigma ^{\prime },\alpha )&={\mathcal {M}}(\sigma ,\alpha ), \end{aligned}$$which contradicts the previous result.
-
(h)
First, we prove that in general, two equivalent assignments are strictly equivalence-closer than two non equivalent assignments:
$$\begin{aligned} \sigma ^{\prime }\in [\alpha ]_{\mathtt {T}}\wedge \sigma \notin [\alpha ]_{\mathtt {T}} \Longrightarrow {\sigma ^{\prime }}\mathrel {\sqsubset ^{\alpha }_{\mathtt {T}}} {\sigma }. \end{aligned}$$This is straightforward, given that we can find a permissible transformation function from \(\sigma ^{\prime }\) to \(\alpha\) but not from \(\sigma\) to \(\alpha\). Therefore, the transformation of \(\sigma ^{\prime }\) is strictly value-closer than any transformation of \(\sigma\).
Let us now consider the following assignments:
$$\begin{aligned} \sigma _1&=(100,10,1)\\ \sigma _2&=(1,10,100)\\ \sigma _3&=(11,21,31)\\ \sigma ^{\prime }&=(10,20,30)\\ \alpha&=(10,20,30). \end{aligned}$$We have that:
$$\begin{aligned} \forall \mathtt {T} \left( \sigma ^{\prime }\in [\alpha ]_\mathtt {T}\right) &\qquad \mathrm{and} \\ \forall \mathtt {T}\ge \mathtt {R} \left( \sigma _3\notin [\alpha ]_\mathtt {T}\right)&\qquad \mathrm {and}\qquad \forall \mathtt {T}\le \mathtt {I} \left( \sigma _3\in [\alpha ]_\mathtt {T}\right) \\ \forall \mathtt {T}\ge \mathtt {I} \left( \sigma _2\notin [\alpha ]_\mathtt {T}\right)&\qquad \mathrm {and}\qquad \forall \mathtt {T}\le \mathtt {O} \left( \sigma _2\in [\alpha ]_\mathtt {T}\right) \\ \forall \mathtt {T}\ge \mathtt {O} \left( \sigma _1\notin [\alpha ]_\mathtt {T}\right)&\qquad \mathrm {and}\qquad \forall \mathtt {T}\le \mathtt {N} \left( \sigma _1\in [\alpha ]_\mathtt {T}\right) . \end{aligned}$$Now, by reduction to absurdity, we can derive the following contradictions:
$$\begin{aligned}&\mathrm {EOM}_{\mathtt {R}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )>{\mathcal {M}}(\sigma _3,\alpha )\mathrm {~and~} \forall \mathtt {T} \text {}\le \mathtt {I} \left( \mathrm {EOI}_{\mathtt {T}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )={\mathcal {M}}(\sigma _3,\alpha )\right) \\&\mathrm {EOM}_{\mathtt {I}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )>{\mathcal {M}}(\sigma _2,\alpha ) \mathrm {~and~} \forall \mathtt {T} \text {}\le \mathtt {O} \left( \mathrm {EOI}_{\mathtt {T}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )={\mathcal {M}}(\sigma _2,\alpha )\right) \\&\mathrm {EOM}_{\mathtt {O}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )>{\mathcal {M}}(\sigma _1,\alpha ) \mathrm {~and~} \forall \mathtt {T} \text {}\le \mathtt {N} \left( \mathrm {EOI}_{\mathtt {T}}\Longrightarrow {\mathcal {M}}(\sigma ^{\prime },\alpha )={\mathcal {M}}(\sigma _1,\alpha )\right) . \end{aligned}$$
\(\square\)
Proof of Corollary 1
The proofs are simply based on the well known property that \(\lnot (A \wedge B)\) is the same as \(A \Rightarrow \lnot B\): by applying this to Lemma 2 (c), (d), (e) (g), and (h), it is immediate to obtain Formulas (17), (19), (21), (23), and (25). Conversely, (18), (20), (22), (24), and (26) are simply derived respectively from (17), (19), (21), (23), and (25) by contraposition (i.e., \(A \Rightarrow \lnot B\) is the same as \(B \Rightarrow \lnot A\)). \(\square\)
Proof of Corollary 2
The corollary statement is the same as saying that satisfying both EOM and EOI at a certain scale type, or both VOM and VOI at certain scale type, is not compatible with any other combination, with the only exception of the case of VOM and VOI at ratio and interval scale types. Let us prove it by analyzing all the possible combinations. First, according to Lemma 2 (g), \(\hbox {VOM}_{\mathtt {T}}\) and \(\hbox {EOI}_{\mathtt {T}'}\) are incompatible for any pair of scale types \({\mathtt {T}}\) and \({\mathtt {T}'}\).
Therefore, value- and equivalence-oriented metrics are always incompatible definitions. Then, let us prove the incompatibility of metrics within the same family:
-
Value-oriented metrics at nominal and ordinal scale types. According to Lemma 2 (c), satisfying VOI implies that VOM can not be satisfied at higher scale types, satisfying VOM implies that VOI can not be satisfied at lower scale types.
-
Value-oriented metrics at interval or ratio scale types. According to Lemma 2 (c), satisfying VOM at interval or ratio scale types implies that VOI can not be satisfied at nominal and ordinal scale types.
-
Equivalence-oriented metrics. According to Lemma 2 (h), satisfying EOI implies that EOM can not be satisfied at lower scale types (or, satisfying EOM implies that EOI can not be satisfied at higher scale types).
\(\square\)
Proof of Theorem 1
The GMON axiom (Axiom 1) states that:
By comparing it with the \(\hbox {VOM}_\mathtt {N}\) axiom (Formula (14)) we see that the consequent is the same, so we just need to prove that the two antecedents are equivalent, i.e., that:
By applying (6) and Lemma 1 for the nominal scale, we have that:
that is equivalent to the right hand side of (37). \(\square\)
Proof of Theorem 2
We want to prove that the EOM axiom and the Generalized Homogeneity/Completeness axiom (GHC, Axiom 2) are equivalent. As in the previous proof of Theorem 1, this is the same as proving that the conditions for a metric score increase in the two axioms are equivalent. The conditions for the GHC axiom are Formulas (27) and (28); the condition for the EOM axiom is the antecedent in (16).
1.1 Part A
First, we prove that the conditions for the GHC axiom implies the condition for the EOM axiom. Using (8), the condition in EOM axiom can be rewritten as \({\sigma }\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma ^{\prime }} \wedge \lnot \left( {\sigma ^{\prime }}\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma }\right)\). In Part A.1 we prove that GHC axiom conditions implies \(\lnot \left( {\sigma ^{\prime }}\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma }\right)\). In Part A.2 we prove that GHC conditions implies \({\sigma }\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma ^{\prime }}\).
1.1.1 Part A.1.
First, we can state the following equivalences:
(where we have used (7), (5), and (4)).
Now, looking at Formula (28) in the GHC conditions, we can observe that it is divided into two or-ed terms. If the first term (\(\sigma ^{\prime }(d_1)=\sigma ^{\prime }(d_2)\wedge \alpha (d_1)\ne \alpha (d_2)\wedge \sigma (d_1)\ne \sigma (d_2)\)) is true, we can define an assignment \(\sigma _e\) such that
Given that \(\sigma ^{\prime }(d_1)=\sigma ^{\prime }(d_2)\), these relationships will be preserved by any equivalent assignment \(\sigma ^{\prime }_e\). Therefore, for at least one of the two documents \(d_1\) or \(d_2\), \(\sigma _e\) will be strictly closer to \(\alpha\) than \(\sigma ^{\prime }_e\).
If the second term (\(\sigma ^{\prime }(d_1)\ne \sigma ^{\prime }(d_2)\wedge \alpha (d_1)=\alpha (d_2)\wedge \sigma (d_1)=\sigma (d_2)\)) is true, we can define an assignment \(\sigma _e\in [\sigma ]\) such that:
Given that \(\sigma ^{\prime }(d_1)\ne \sigma ^{\prime }(d_2)\), again, these relationships will be preserved by any equivalent assignment \(\sigma ^{\prime }_e\). Therefore, for at least one of the two documents \(d_1\) or \(d_2\), \(\sigma _e\) will be strictly closer to \(\alpha\) than \(\sigma ^{\prime }_e\).
Therefore, in both cases, the assertion:
is true, and therefore, using (38), \(\lnot \left( {\sigma ^{\prime }}\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma }\right)\).
1.1.2 Part A.2.
Now we prove that the GHC conditions imply \({\sigma }\mathrel {\sqsubseteq ^{\alpha }_{\mathtt {N}}} {\sigma ^{\prime }}\), which according to (7) can be expressed as:
That is, we have to prove that for any assignment \(\sigma ^{\prime }_e\) in \([\sigma ^{\prime }]\) there exists an assignment \(\sigma _e\) in \([\sigma ]\) such that it is value-closer to \(\alpha\) than \(\sigma ^{\prime }_e\) for the scale type \(\mathtt {N}\). To do so, we make the following four steps: we define an assignment \(\sigma _e\) which depends on the \(\alpha\), \(\sigma\), and \(\sigma ^{\prime }\) values (A.2.1); we prove that it is an assignment, i.e., it is a function that assigns a value to each item (A.2.2); we prove that \(\sigma _e\) belongs to \([\sigma ]\) (A.2.3); and finally we prove that \(\sigma _e\) is value-closer to \(\alpha\) than \(\sigma ^{\prime }\) (A.2.4).
Part A.2.1. Let us define the following assignment:
This assignment assigns the correct \(\alpha\) value to documents that are correctly assigned by \(\sigma ^{\prime }_e\), and extends this assignment to other documents, namely all those that are assigned the same \(\sigma\) value as d and are correctly assigned by \(\sigma ^{\prime }_e\), for preserving the consistency with \(\sigma\), that is, in order to belong to \([\sigma ]\). In other terms, \(\sigma _e(d)\) assigns the correct value \(\alpha\) when \(\sigma ^{\prime }_e\) does. In addition, \(\sigma _e\) is consistent (equivalent) with \(\sigma\) at nominal level, given that the equality relationships are preserved in both \(\alpha\) values and \(\sigma (d)+ K\) values. The K value is just a number large enough to avoid overlapping with \(\alpha\) values (for example, if \(\alpha\) and \(\sigma\) assign values in \(\{0,1\}\), we can define \(K=2\)).
Part A.2.2. We have to prove that \(\sigma _e\) is an assignment. It is enough to ensure that two values are not assigned simultaneously by \(\sigma _e(d)\). For this, we will prove that if there exist two documents \(d_j\) in Formula (39) that are assigned a value by \(\sigma ^{\prime }_e(d)\), then these values are the same one. That is, if
and
then \(\alpha (d_j)=\alpha (d_k)\). That is, we have to prove that
Here we use the GHC Axiom. According to the second part of Formula (27)
therefore:
Then:
Given that \(\sigma ^{\prime }_e\) is equivalent to \(\sigma ^{\prime }\), this implies that:
Therefore, two values are not assigned simultaneously by \(\sigma _e(d)\) and it is an assignment.
Part A.2.3. Now, we have to prove the assignment \(\sigma _e\) belongs to the class \([\sigma ]\). We know that at the nominal scale type, two assignments \(\sigma\) and \(\sigma _e\) are equivalent if:
Suppose that \(\sigma (d_i)=\sigma (d_j)\). Then, if
then
and therefore (see (39)) \(\sigma _e(d_i)=\sigma _e(d_j)=\alpha (d_k)\). In the other cases:
On the other hand, if \(\sigma (d_i)\ne \sigma (d_j)\), then according to (39) the only case in which \(\sigma _e(d_i)=\sigma _e(d_j)\) is if
there exist two documents \(d_k\) and \(d_l\) such that
and \(\sigma _e(d_i)=\sigma _e(d_j)=\alpha (d_k)=\alpha (d_l)\). Then, in this case:
Given that \(\sigma ^{\prime }_e\in [\sigma ^{\prime }]\):
which is not compatible with the GHC conditions.
Therefore, the proposed assignment \(\sigma _e\) belongs to the class \([\sigma ]\).
Part A.2.4. Finally, we need to prove that \(\sigma _e\) is value-closer to \(\alpha\) than \(\sigma ^{\prime }_e\) for the scale type \(\mathtt {N}\):
This is solved by definition of \(\sigma _e\).
Therefore, according to Lemma 1 we can conclude that for every document d, \(\sigma _e(d)\) is non strictly closer to \(\alpha (d)\) than \(\sigma ^{\prime }_e(d)\):
Therefore, according to Definition 3, \(\sigma _e\) is non-strictly value-closer to \(\alpha\) than \(\sigma ^{\prime }_e\) for the nominal scale type:
Given that this is true for every assignment \(\sigma ^{\prime }_e \in [\sigma ^{\prime }]_{\mathtt {N}}\), we can conclude that \(\sigma\) is non-strictly equivalence-closer to \(\alpha\) than \(\sigma ^{\prime }\) for the nominal scale type:
So, the EOM condition is satisfied.
1.2 Part B
Now, we will prove that the EOM condition is enough for the GHC axiom conditions. The GHC conditions include Formulas (27) and (28). We will prove that they are satisfied in parts B.1 and B.2 respectively.
1.2.1 Part B.1.
The EOM condition states that, for any assignment \(\sigma ^{\prime }_e\) in \([\sigma ^{\prime }]_\mathtt {N}\) we can find an assignment \(\sigma _e\) in \([\sigma ]_\mathtt {N}\) such that \({\sigma _e}\mathrel {\trianglelefteq ^{\alpha }_{\mathtt {N}}} {\sigma ^{\prime }_e}\). Let us consider the two conditions in Formula (27). Then, if
there exists an assignment \(\sigma ^{\prime }_e\) in the equivalence class such that:
Given that there must exist an assignment in the class \([\sigma ]_\mathtt {N}\) closer to \(\alpha\), then the assignments in \([\sigma ]_\mathtt {N}\), including \(\sigma ^{\prime }\), must satisfy:
Therefore, the first condition in Formula (27) in the GHC axiom is satisfied:
We can apply the same reasoning for the situation
That is, there exists an assignment \(\sigma ^{\prime }_e\) in the equivalence class \([\sigma ^{\prime }]_{\mathtt {N}}\) such that:
Given that there must exist an assignment in the class \([\sigma ]_\mathtt {N}\) closer to \(\alpha\), then the assignments in \([\sigma ]_\mathtt {N}\), including \(\sigma\), must satisfy:
Therefore, the second condition in Formula (27) in the GHC axiom is also satisfied:
1.2.2 Part B.2.
In addition, the conditions in Formula (28) are also necessary. Otherwise, if they were false then:
Therefore, being \(\sigma _e\) an assignment belonging to the class \([\sigma ]_{\mathtt {N}}\), we can find an assignment \(\sigma ^{\prime }_e\) in \([\sigma ^{\prime }]_{\mathtt {N}}\) such that:
This assignment is:
Therefore, according to Lemma 1, we can conclude that \(\sigma ^{\prime }_e\) is closer to \(\alpha\) than \(\sigma _e\):
Given that this is true for any assignment \(\sigma _e\) in the class \([\sigma ]_{\mathtt {N}}\), we can conclude that:
which contradicts
Therefore, EOM and GHC axioms are equivalent. \(\square\)
Proof of Theorem 3
We want to prove that EOM at ordinal scale type is equivalent to PRI, the Priority Axiom. For this, it is enough to prove that the EOM and PRI conditions are equivalent. On the one hand, Priority Axiom (Axiom 3) states that if two contiguous documents are swapped in concordance with the gold then the metric score increases. In other terms, ifFootnote 12
and being \(\sigma\) the results of swapping \(d_1\) and \(d_2\) in \(\sigma ^{\prime }\) then \({\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha )\).
Given that \({\mathcal {M}}(\sigma ,\alpha )\) is a function which depends exclusively on the \(\sigma\) and \(\alpha\) assignment values, swapping documents in the ranking \(\sigma\) with equal value in \(\alpha\) does not affect \({\mathcal {M}}(\sigma ,\alpha )\). Therefore, by transitivity, we can say that the condition of the Priority Axiom (Formula 29) is equivalent to \(S(\sigma ,\sigma ^{\prime })\), which denotes the fact that \(\sigma\) can be obtained by swapping documents in the \(\sigma ^{\prime }\) ranking without contradicting \(\alpha\).
On the other hand, the EOM condition (the antecedent in (16)) holds when \(\sigma\) is equivalence-closer than \(\sigma ^{\prime }\). That is, for all ordinal transformation of \(\sigma ^{\prime }\) there exists a transformation of \(\sigma\) strictly closer to \(\alpha\). We will denote this condition as \(C_\mathrm {EOM}\). Then, we need to prove that:
1.3 Part A
First, let us prove that:
which, by contraposition, is equivalent to:
If \(\sigma = \sigma ^{\prime }\) then (40) is true, given that neither \(S(\sigma ,\sigma ^{\prime })\) nor \(C_\mathrm {EOM}\) hold. In the other case, whenever \(\sigma \ne \sigma ^{\prime }\), if there does not exist any (correct) swapping sequence from \(\sigma ^{\prime }\) to \(\sigma\), then we can assert that there exist at least two documents which are correctly sorted in \(\sigma ^{\prime }\) but not in \(\sigma\). That is:
Under this condition, let us consider the permissible transformation function \(f_{\sigma ^{\prime }}\) for the ordinal scale type such that
Now, by reduction to absurdity, let us assume that the EOM condition \(C_\mathrm {EOM}\) holds, i.e., that there exists another monotonic function \(f_{\sigma }\) such that it is value-closer to \(\alpha\) for every d. Therefore:
Since \(f_{\sigma }\) is a monotonic function, we have:
But this contradicts \(\alpha (d_1)>\alpha (d_2)\). Therefore, we can not find a transformation \(f_{\sigma }\) of \(\sigma\) such that all values are closer to \(\alpha\) than the \(f_{\sigma ^{\prime }}\) transformation of \(\sigma ^{\prime }\). Therefore, the condition of EOM can not be satisfied and this proves (40).
1.4 Part B
Now, let us prove that the conditions for PRI (Formula (29)) imply satisfying the condition of the EOM axiom. The priority axiom states that if the rankings \(\sigma\) and \(\sigma ^{\prime }\) are equivalent except in the case of \(d_1\) and \(d_2\) where:
then \(\sigma\) must achieve a higher score than \(\sigma ^{\prime }\).
Let \(\sigma ^{\prime }_e\) be any assignment in the equivalence class \([\sigma ^{\prime }]\). Then, we consider an assignment \(\sigma _e\) such that:
where k is a number small enough such that:
The intuition is that \(\sigma _e\) moves \(d_2\) down in the rank as much as it is needed to correctly (w.r.t. \(\alpha\)) swap it with \(d_1\), without affecting the other relationships. Then, we can assert that \(\sigma _e\) belongs to the equivalence class \([\sigma ]\), given that: (i) for every pair of documents different than \(d_1\) and \(d_2\), \(\sigma\), \(\sigma ^{\prime }\), \(\sigma _e\) and \(\sigma _e'\) keep the same ordinal relationships; (ii) the relationship between \(d_1\) and \(d_2\) is the same in \(\sigma\) (according to Formula (41)) and in \(\sigma _e\) (according to Formula (42)); and (iii) given that \(d_1\) and \(d_2\) are contiguous in \(\sigma ^{\prime }\), \(\sigma _e'\), \(\sigma\) and \(\sigma _e\), the relationships with the rest of the documents are the same in all assignments.
Now, let us consider an assignment \(\alpha ^{\prime }\) in \([\alpha ]\) such that \(\alpha ^{\prime }(d_1)=\sigma ^{\prime }_e(d_1)=\sigma _e(d_1)\). Then we can assert that \(\sigma _e\) is value-closer to \(\alpha ^{\prime }\) than \(\sigma _e'\) (i.e., \({\sigma _e}\mathrel {\trianglelefteq ^{\alpha ^{\prime }}_{\mathtt {T}}} {\sigma ^{\prime }_e}\)): for every \(d \ne d_2\), \(\sigma _e\) is closer given that
and regarding \(d_2\), being k small enough and knowing that \(\alpha ^{\prime }(d_1)>\alpha '(d_2)\), we have that
That is, for any transformation \(\sigma ^{\prime }_e\) of \(\sigma ^{\prime }\) we can find a transformation \(\sigma _e\) of \(\sigma\) which is value-closer to \(\alpha ^{\prime }\). Therefore:
As the last step of the proof, we need to prove that equivalence-closeness is invariant under equivalent assignments. That is:
To prove this, it is enough to consider that equivalence-closeness at ordinal scale type is invariant across ordinal transformations of \(\sigma _1\) or \(\sigma _2\):
and that value-closeness at ordinal scale type is invariant when applying the same transformation to the three assignments:
Therefore, being \(f_\alpha\) the transformation such that \(f_\alpha (\alpha ^{\prime })=\alpha\):
In our case, from (43) we can thus derive that:
Finally, for a transformation \(\sigma _e\) in \([\sigma ]\) such that \(\sigma _e(d_1)=\alpha (d_1)\) and \(\sigma _e(d_2)=\alpha (d_2)\) we can not find a transformation of \(\sigma _e'\) in \([\sigma ^{\prime }]\) closer to \(\alpha\) than \(\sigma _e\) due to \(\sigma ^{\prime }(d_1)<\sigma ^{\prime }(d_2)\). Therefore, we can assert that the priority axiom conditions implies:
That is, the EOM conditions for the ordinal scale type. \(\square\)
Proof of Theorem 4
The proof is straightforward. For the nominal scale type, applying a permissible transformation function (bijective function) is equivalent to changing the names of classes. Given that the contingency matrix is based on equalities and inequalities between system output and gold, modifying the class names for both the system output and the gold does not affect the matrix. \(\square\)
Proof of Theorem 5
It is enough to prove that these metrics satisfy the Generalized Monotonicity Axiom (GMON), which is equivalent to VOM at nominal scale according to Theorem 1. GMON states that:
For Accuracy and Macro-Average Accuracy the proof is straightforward. The metric Accuracy directly counts how many documents are classified correctly. Therefore, the first assignment \(\sigma\) will achieve always a higher accuracy than \(\sigma ^{\prime }\) under the previous conditions. The Macro-Average Accuracy consists of averaging Accuracy scores across classes, that is, values in the gold. It can be expressed as:
Under the GMON conditions, the assignment \(\sigma\) achieves the same Accuracy as \(\sigma ^{\prime }\) over all values, excepting one of them in which its Accuracy is higher.
The Phi Correlation is a metric that can be applied only over two classes, that is, two values in both assignments. In terms of the contingency matrix, it is expressed as (TP, TN, FP, FN, stand for True Positive, True Negative, False Positive, and False Negative, respectively):
Under the GMON conditions, there are two possibilities. TP is increased at the cost of FN or TN is increased at the cost of FP. In the first case:
Now, if we derive the function:
with respect to x, we obtain a function which is positive for every x value. We can apply exactly the same procedure to the second case. Therefore, the function is an increasing monotonic function. That is, the Phi value increases under the GMON conditions. \(\square\)
Proof of Theorem 6
The proof is straightforward. It is enough to say that the information given by a partition is invariant across permissible transformation functions for the nominal scale type, i.e., the bijective functions (applied to one partition or both). Therefore, any function that takes a partition as input is invariant and it satisfies EOI. \(\square\)
Proof of Theorem 7
We will prove that Entropy satisfies GHC (Axiom 2), and therefore, according to Theorem 2, it also satisfies EOM at the nominal scale type. We can state that the GHC conditions (see Formulas (27) and (28)) concern two clustering outputs \(\sigma\) and \(\sigma ^{\prime }\) such that the only differences are either: (i) \(\sigma\) splits clusters of \(\sigma ^{\prime }\) into clusters containing items from different classes in the gold \(\alpha\), or (ii) \(\sigma\) joins clusters of \(\sigma ^{\prime }\) containing items in the same class in \(\alpha\). In other terms, in case (ii), if \(\sigma ^{\prime }\) breaks the relationship between items from the same class in \(\alpha\), then \(\sigma ^{\prime }\) is producing a pair error; in case (i), if \(\sigma ^{\prime }\) joins clusters containing items from different classes in \(\alpha\), \(\sigma ^{\prime }\) is producing incorrect relationships. In both cases the metric value for \(\sigma ^{\prime }\) must be lower than \(\sigma\).
Knowing this, let us prove now that joining clusters containing items from the same class or splitting clusters in this way increases Entropy. Let us focus on the joining case (ii) first. Let \(c_1, c_2 \in {\mathcal {V}}(\sigma ^{\prime })\) be the two clusters containing elements from the same class \(l \in {\mathcal {V}}(\alpha )\) in the gold, and that are joined in \(\sigma\). The Entropy of both clusters (i.e., for \(\sigma ^{\prime }\)) is:
(we slightly abuse the notation introduced in Sect. 7.2 by replacing assignments with clusters), and Entropy after the join (i.e., for \(\sigma\)) is zero as well:
(where \(c= c_1\cup c_2\)). That is, joining these clusters does not affect Entropy. However, Class Entropy for the label l is (before the joining process, i.e., for \(\sigma ^{\prime }\)):
Now let us remark that, for any \(a,b\in (0..1)\) and \(a+b<1\), we have
and, given that a and b are positive:
Therefore:
By exploiting this property in (45) we obtain:
which corresponds with the entropy for the class l after the joining process (i.e., for \(\sigma\)). Therefore, Class Entropy decreases when joining pure clusters: being the evaluation score inversely correlated with the entropy values, this proves that the metric value for \(\sigma ^{\prime }\) is lower than \(\sigma\).
Now let us consider the case (i) in which one cluster in \(\sigma ^{\prime }\) is split into two clusters \(c_1,c_2 \in {\mathcal {V}}(\sigma )\). When splitting two clusters \(c_1,c_2 \in {\mathcal {V}}(\sigma )\) containing items from different classes in the gold:
But Entropy is averaged by considering the size of clusters. Therefore, given that \(E(c_1,\alpha )\) and \(E(c_2,\alpha )\) are positive numbers \(\left( -P(x)\log (x)\in (0..+\infty )\right)\):
we can say that Entropy decreases when splitting clusters with items belonging to different classes (from \(\sigma ^{\prime }\) to \(\sigma\)). In addition, Class Entropy is not affected by this splitting, given that the distribution of classes across clusters is not affected.
Therefore, remembering that Entropy and Class Entropy are inversely correlated with the evaluation metric, they satisfy GHC and thus EOM at the nominal scale type. \(\square\)
Proof of Theorem 8
The proof is straightforward. In our framework, system outputs are value assignments, that is, document relevance scores estimated by systems. Therefore, by definition, the resulting ranking is invariant under ordinal transformation functions of these scores. In other words, applying any permissible transformation function (strict monotonic function) over the system output does not affect the resulting ranking. That is, evaluation metrics for ranking are invariant across equivalent assignments for the ordinal scale type. \(\square\)
Proof of Theorem 9
For this proof, it is enough to say that (Amigó et al. 2013), proved that these metrics satisfy PRI, which is equivalent to EOM at ordinal scale according to Theorem 3. \(\square\)
Proof of Theorem 10
We need to prove that the Mean Absolute Error with a reference difference is a value-oriented metric for the interval scale type. VOI is satisfied, given that the metric is invariant across linear transformations of \(\alpha\) and \(\sigma\) values, as it is immediately seen by considering linear affine transformations like (\(X^{\prime } = aX+b\)) and observing that:
VOM is satisfied, given that by applying (6) and Lemma 1 for the interval scale type, we have that
and this is equivalent to
Therefore:
\(\square\)
Proof of Theorem 11
VOI is satisfied, given that the metric is invariant across linear transformation of \(\alpha\) and \(\sigma\) values, as it is immediately seen by considering linear transformations like (\(X^{\prime } = aX\)) and observing that:.
The proof for VOM is analogous to that of previous Theorem 10:
\(\square\)
Proof of Theorem 12
We want to prove that the Pearson coefficient is an equivalence-oriented metric for the interval scale type, i.e., it satisfies \(\hbox {EOI}_\mathtt {I}\) and \(\hbox {EOM}_\mathtt {I}\). A first remark is that the invariance under interval permissible transformation functions (i.e., linear functions \(y=a\cdot x+b\)) is a well known property of Pearson linear correlation coefficient. Therefore, \(\hbox {EOI}_\mathtt {I}\) is satisfied.
Then, we need to prove that Pearson satisfies EOM at interval scale type. That is, Pearson correlation with a gold \(\alpha\) increases when going from an assignment \(\sigma _1\) to an equivalence-closer to \(\alpha\) assignment \(\sigma _2\):
We prove it by contraposition; therefore, we need to prove:
i.e., that if \({{\,\mathrm{CORR}\,}}(\sigma _2,\alpha )\le {{\,\mathrm{CORR}\,}}(\sigma _1,\alpha )\) then \(\sigma _2\) is not equivalence-closer to \(\alpha\) than \(\sigma _1\).
We start by considering the normalization \(\overline{\sigma }\) of an assignment \(\sigma\)
(where \({{\,\mathrm{\text {Avg}}\,}}(\sigma )\) represents the average value and we use \({{\,\mathrm{Dev}\,}}(\sigma )\) for the standard deviation, in order to avoid confusion with the notation of \(\sigma\) as an assignment) that has the following properties:
and
Given that Pearson is invariant under interval permissible transformations, it holds that:
Then, according to (50)
Now, let us consider any linear transformation \(\sigma ^{\prime }_2\) of the normalized assignment \(\overline{\sigma }_2\). That is: \(\sigma ^{\prime }_2(i)=a\cdot \overline{\sigma }_2(i)+b\) with \(a>0\). Then:
According to (49):
Then, according to (51) and given that \(a>0\):
Finally, applying (48) twice to replace a zero term with another zero term:
That is, reading the equality/inequality chain, we can assert that for every value \(a>0\) and b:
Now, we will prove that this result is not compatible with \({\sigma _2}\mathrel {\sqsubset ^{\alpha }_{\mathtt {I}}} {\sigma _1}\), thus proving (47).
Among all the permissible transformation functions for the interval scale, let \(f^*(\omega ) = a^* \cdot \omega + b^*\) be the transformation that minimizes \(\sum _{i\in {\mathcal {D}}} (f^*(\overline{\sigma }_2)-\overline{\alpha }(i))^2\). Let us denote with \(\omega ^* = f^*(\omega )\). Then, according to (52):
Therefore, the Euclidean distance to \(\overline{\alpha }\) of the transformation of \(\overline{\sigma }_1\) (i.e., \(f^*(\overline{\sigma }_1)\)) is smaller than the minimal Euclidean distance to \(\overline{\alpha }\) of \(\overline{\sigma }_2\) transformations. Now, given that two equivalent assignments belong to the same equivalence class (\(\omega ~\approx _\mathtt {T}~\omega ^{\prime }\Longrightarrow [\omega ]=[\omega ^{\prime }]\)) and they have the same set of permissible transformations, and given that \(\overline{\sigma }_1~\approx _\mathtt {I}~\sigma _1\) and \(\overline{\sigma }_2~\approx _\mathtt {I}~\sigma _2\), then we can say that the Euclidean distance to \(\overline{\alpha }\) of the transformation of \(\sigma _1\) (i.e., \(f^*(\sigma _1)\)) is lower than the minimal Euclidean distance to \(\overline{\alpha }\) of \(\sigma _2\) transformations:
Now, we have to prove that if an assignment \(\omega _2\) is value-closer to \(\alpha\) than another assignment \(\omega _1\), then the Euclidean distance must be lower:
According to (6):
According to Lemma 1, the consequent is equivalent to:
In addition, (54) can be also expressed as:
Therefore, considering (53) we have:
In other words, there does not exist any transformation of \(\sigma _2\) strictly value-closer to \(\overline{\alpha }\) than \(f^*(\overline{\sigma }_1)\). Therefore, \(\sigma _2\) is not equivalence-closer to \(\overline{\alpha }\) than \(\sigma _1\), that is \(\lnot ({\sigma _2}\mathrel {\sqsubset ^{\overline{\alpha }}_{\mathtt {I}}} {\sigma _1})\).
As the last step of the proof, we have to consider that
which was already proved in the proof of Theorem 3, see (44) (indeed that proof is for non-strict closeness, but nothing changes for the strict version).
Therefore, we have proved (47) and thus (46). \(\square\)
Proof of Theorem 13
We want to prove that the cosine is an equivalence-oriented metric for the ratio scale type, i.e., it satisfies \(\hbox {EOI}_\mathtt {R}\) and \(\hbox {EOM}_\mathtt {R}\). The proof has the same structure of the previous proof of Theorem 12, but we repeat everything to make this proof self contained. A first remark is that the invariance under interval permissible transformation functions (i.e., ratio functions \(y=a\cdot x\)) is a well known property of cosine. Therefore, \(\hbox {EOI}_\mathtt {R}\) is satisfied.
Then, we need to prove that cosine satisfies EOM at ratio scale type. That is, cosine with a gold \(\alpha\) increases when going from an assignment \(\sigma _1\) to an equivalence-closer to \(\alpha\) assignment \(\sigma _2\):
We prove it by contraposition; therefore, we need to prove:
i.e., that if \({{\,\mathrm{\text {COS}}\,}}(\sigma _2,\alpha )\le {{\,\mathrm{\text {COS}}\,}}(\sigma _1,\alpha )\) then \(\sigma _2\) is not equivalence-closer to \(\alpha\) than \(\sigma _1\).
We start by considering the normalization \(\overline{\sigma }\) of an assignment \(\sigma\) as the ratio transformation:
This normalization has the following properties:
and
Given that cosine is invariant under ratio permissible transformations, it holds that:
Then, according to (59)
Now, let us consider any linear transformation \(\sigma ^{\prime }_2\) of the normalized assignment \(\overline{\sigma }_2\). That is: \(\sigma ^{\prime }_2(i)=a\cdot \overline{\sigma }_2(i)\) with \(a>0\). Then:
According to (58):
Then, according to (60) and given that \(a>0\):
That is, reading the equality/inequality chain, we can assert that for every value \(a>0\):
Now, we will prove that this result is not compatible with \({\sigma _2}\mathrel {\sqsubset ^{\alpha }_{\mathtt {R}}} {\sigma _1}\), thus proving (56).
Among all the permissible transformation functions for the interval scale, let \(f^*(\omega ) = a^* \cdot \omega\) be the transformation that minimizes \(\sum _{i\in {\mathcal {D}}} (f^*(\overline{\sigma }_2)-\overline{\alpha }(i))^2\). Let us denote with \(\omega ^* = f^*(\omega )\). Then, according to (61):
Therefore, the Euclidean distance to \(\overline{\alpha }\) of the transformation of \(\overline{\sigma }_1\) (i.e., \(f^*(\overline{\sigma }_1)\)) is smaller than the minimal Euclidean distance to \(\overline{\alpha }\) of \(\overline{\sigma }_2\) transformations. Now, given that two equivalent assignments belong to the same equivalence class (\(\omega ~\approx _\mathtt {T}~\omega ^{\prime }\Longrightarrow [\omega ]=[\omega ^{\prime }]\)) and they have the same set of permissible transformations, and given that \(\overline{\sigma }_1~\approx _\mathtt {I}~\sigma _1\) and \(\overline{\sigma }_2~\approx _\mathtt {I}~\sigma _2\), then we can say that the Euclidean distance to \(\overline{\alpha }\) of the transformation of \(\sigma _1\) (i.e., \(f^*(\sigma _1)\)) is lower than the minimal Euclidean distance to \(\overline{\alpha }\) of \(\sigma _2\) transformations:
Now, we have to prove that if an assignment \(\omega _2\) is value-closer to \(\alpha\) than another assignment \(\omega _1\), then the Euclidean distance must be lower:
According to (6):
According to Lemma 1 the consequent is equivalent to:
In addition, (63) can be also expressed as:
Therefore, considering (62) we have:
In other words, there does not exist any transformation of \(\sigma _2\) strictly value-closer to \(\overline{\alpha }\) than \(f^*(\overline{\sigma }_1)\). Therefore, \(\sigma _2\) is not equivalence-closer to \(\overline{\alpha }\) than \(\sigma _1\), that is \(\lnot ({\sigma _2}\mathrel {\sqsubset ^{\overline{\alpha }}_{\mathtt {R}}} {\sigma _1})\).
As the last step of the proof, we have to consider that
which was already proved in the proof of Theorem 3, see (44) (indeed that proof is for non-strict closeness, but nothing changes for the strict version). Therefore, we have proved (56) and thus (55). \(\square\)
Rights and permissions
About this article
Cite this article
Amigó, E., Mizzaro, S. On the nature of information access evaluation metrics: a unifying framework. Inf Retrieval J 23, 318–386 (2020). https://doi.org/10.1007/s10791-020-09374-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-020-09374-0