Introduction

Author name disambiguation is an entity resolution task to generate clusters of name instances to refer to unique authors in bibliographic data. It is crucial to research that mines authorship data because ambiguous names can lead to merging and/or splitting of author identities and thus flawed knowledge about research production and collaboration (Fegley and Torvik 2013; Kim and Diesner 2015, 2016; Strotmann and Zhao 2012). As publications and ambiguous author names such as East Asian names increase in digital libraries (Bornmann and Mutz 2015; Torvik and Smalheiser 2009), various methods for author name disambiguation (Hussain and Asghar 2017; Smalheiser and Torvik 2009) have been proposed.

After a disambiguation method is implemented, its clustering performance is evaluated by measures such as Cluster-F, K-metric, Pairwise-F, Splitting & Lumping Error, and B-cubed. As there is no consensus on a definitive measure for evaluating author name disambiguation (Ferreira et al. 2012), one or two measures are chosen at the researcher’s discretion. The selection is, sometimes, justified by an argument that the selected measure is frequently used. In many studies, however, a measure is selected without any clarification.

A clustering measure should be selected considering the context of each study. The choice can, however, change our impression about a disambiguation method if its performance is evaluated high by one measure, but low or mediocre by another. Calculating diverse measures in a disambiguation study can be a nontrivial task because clustering measures have distinct evaluation approaches which are not easy to compare their similarities and differences. In addition, the straightforward implementation of a measure like Pairwise-F can consume too much runtime depending on data size because the number of instance pairs for comparison can increase quadratically in a worst-case scenario (Menestrina et al. 2010).

To help scholars select clustering measures that characterize best their disambiguation results, this study shows that five commonly used measures for evaluating clustering performance in author name disambiguation can be calculated all-in-one by implementing a common set of code. This integrative calculation shows intuitively how these measures are similar and different in evaluating clustering results. Especially, the proposed approach reduces computation runtime, dramatically for Pairwise-F in particular. In the following sections, the usage patterns of clustering measures in author name disambiguation are reviewed. Then, the integration process is explained step-by-step with pseudo-code and examples.

Literature review

Table 1 shows the list of selected author name disambiguation studies and their measures for evaluating clustering performance. Note that detailed explanation of each measure will be provided in the Results section in this paper.

Table 1 Clustering performance measures in selected author name disambiguation studies

According to the table, Pairwise-F is the most popular measure. It appears in 15 out of 23 studies. This confirms that it is the most frequently used measure in entity resolution in general (Maidasani et al. 2012; Menestrina et al. 2010) as well as in author name disambiguation (Levin et al. 2012).Footnote 1 K-metric is calculated in 8 studies, followed by B-cubed (B3, 7) and Cluster-F (5). Three studies use the Splitting and Lumping Errors (SE & LE) measure.

In Table 1, 11 out of 23 studies rely on a single measure while others on two or three measures. In addition, the combinations of co-used measures vary. Figure 1 shows pairs of co-used measures in Table 1 and their co-usage frequencies. For example, Pairwise-F is paired with K-metric 7 times. Interestingly, some possible pairs have never been calculated together. For example, B3 is paired with Pairwise-F twice but not with K-metric, Cluster-F, and SE and LE.

Fig. 1
figure 1

Co-usage frequency of clustering measures in Table 1

The use of Pairwise-F is often justified because it is frequently used in entity resolution studies. Sometimes, measures are selected to follow the practice of referenced studies or without any clarification. Although such choices should be understood in each study’s unique context, they can change our impression about the performance of a disambiguation method. To illustrate this, the clustering performance of a digital library, DBLP (Ley 2009; Reitz and Hoffmann 2013), was evaluated on a labeled dataset, KISTI (Kang et al. 2011). KISTI consists of a set of ambiguous name instances filtered from DBLP and disambiguated semi-manually by researchers at the Korean Institute for Science and Technology Information. Among 41,673 name instances in the original KISTI, a total of 41,358 name instances are matched to DBLP (2017 September) records.Footnote 2 Figure 2 shows DBLP’s clustering performance evaluated on KISTI by five measures.

Fig. 2
figure 2

Performance of DBLP’s author name disambiguation evaluated by five measures on KISTI

According to the figure, DBLP’s disambiguation is highly accurate: precision, recall, and F scores of three measures—Pairwise-F, B3, and K-metric—are all above 0.95, corroborating Kim (2018). Cluster-F and SE & LE scores are, however, not so much encouraging. Especially, Cluster-F shows that DBLP performs a little worse in recall than in precision, which contrasts other three measures reporting that DBLP performs better in recall than in precision. According to SE & LE, DBLP disambiguates names better in terms of recall than precision, but the recall-precision performance gap (|recall − precision| = 0.1794) is much pronounced than those by Pairwise-F, K-metric, and B3 (|recall − precision| = 0.0346–0.0487).

This illustrates why we need to consider various clustering measures to evaluate a disambiguation method. Depending on the choices of measures, the same clustering results can be evaluated as encouraging or less so. As shown in Table 1, however, the selection of measures do not seem to be guided by any common practice. But this does not imply that scholars need to report evaluation results calculated by all available measures, which is undesirable for efficient communication.

Instead, it should be considered that using diverse measures can illuminate where a disambiguation method performs well and needs improvement. For example, the low Cluster-F coupled with high B3 in Fig. 2 indicates that misidentified name instances by DBLP are not many (high B3 scores) but appear across several truth clusters (low Cluster-F) because a single misidentified instance in a truth cluster decides the DBLP’s performance for the whole cluster as a failure. In addition, diverse measures can enable scholars to compare performances of their disambiguation methods with other studies evaluated by different measures and thus to find room for improvement or synthesize strengths of each study.

Calculating various measures for a disambiguation study can, however, be a nontrivial task. Each measure needs to be implemented with a careful validation of accuracy. In addition, each measure can be implemented using different code snippets which are not often shared. So, scholars who want to implement a clustering measure usually need to write code from scratch. Sometimes, a measure may not be easily implementable for a large dataset. For example, calculating Pairwise-F can consume much computing time and RAM because the number of instance pairs can increase quadratically “in the worst case” (Menestrina et al. 2010).Footnote 3

To facilitate the efficient use of diverse clustering measures in author name disambiguation, this study proposes an algorithm to calculate the five commonly used measures all-in-one in an integrative framework. Specifically, although the five measures have different evaluation approaches, they can be calculated by a common set of code, which will help us understand better the similarities and differences of the measures. This integrative calculation is the first attempt of this sort and a novel contribution to the measurement of clustering performance in author name disambiguation. Moreover, during the integration process, B3 and K-metric are shown to produce the same precision and recall scores. Within this framework, especially, Pairwise-F is calculated by a heuristic rather than a brute-force comparison of instance pairs, reducing greatly computation time from quadratic (at worst) to linear one. This solution is motivated by Menestrina et al. (2010) in which Pairwise-F is calculated linearly through a Slice algorithm combined with a cost function. This study combines the Slice algorithm with a heuristic to calculate Pairwise-F faster than the ‘Slice algorithm + cost function’ approach. In following sections, the details of integrative calculation are described with examples and pseudo-code.

Methods

Scholars usually evaluate clustering results in two ways: recall and precision. Here, a cluster consists of name instances that are decided to represent the same authors by a disambiguation algorithm (a predicted cluster) or manual labeling (a truth cluster). Recall measures how many truth clusters are not compromised by merged or split name instances in predicted clusters, while precision measures how many predicted clusters group correctly name instances that belong to the same truth clusters.

Incorporating the aforesaid five measures into the same framework is possible because all of them evaluate disambiguation results by both recall and precision. What makes them different from one another is that each measure is designed to assess precision and recall at one of three levels: cluster, instance, or pair of instances, as summarized in Table 2.

Table 2 Summary of calculation level and recall-precision types per performance measure

Despite such different calculation levels, the measures can be implemented by embedding the instance- and pair-level calculations into the cluster level calculation through a set of common steps (“skeleton code” hereafter). Algorithm 1 shows the skeleton code.

figure a

A key idea of Algorithm 1 is that truth clusters are not compared cluster by cluster to predicted ones. Instead, a name instance (p) in a predicted cluster (Pi) is recorded into a hash table (pIndex) where the instance p (key) is mapped to its cluster membership (= i: value) (code line 2–7). Next, a name instance (t) in a truth cluster (Tj) is checked for its index (i) in predicted clusters (P) by referencing pIndex. Then, the count of the index (i) is recorded into another hash table (tMap) where an index i (key) is mapped to its frequency (value) (code line 10–15). In other words, this code snippet counts the number of name instances in a truth cluster that appear together in predicted clusters (= sharing the same i), which corresponds to detecting the intersection of a truth cluster (Tj) and predicted clusters (P). Note that this procedure adopts part of the Slice algorithm in Menestrina et al. (2010).

Within this cluster-level calculation framework, pair- and instance-level measures can be also calculated. To demonstrate this, each measure is explained in detail below starting from cluster- to pair- and instance-levels.

Results

Cluster level: Cluster-F

Cluster-F (cF) is a harmonic mean of cluster recall (cR) and cluster precision (cP) (Menestrina et al. 2010).

$$cR = \frac{{\left| {P \cap T} \right|}}{\left| T \right|}$$
(1)
$$cP = \frac{{\left| {P \cap T} \right|}}{\left| P \right|}$$
(2)
$$cF = \frac{2 \times cR \times cP}{(cR + cP)}$$
(3)

Here, P is a set of predicted clusters, while T is a set of truth clusters. The numerator \(\left| {P \cap T} \right|\) counts the number of predicted clusters that contain all and the only instances belonging to the same truth clusters. Cluster recall (cR) is the ratio of the numerator over the number of all truth clusters (\(\left| T \right|\)). Cluster precision (cP) is the ratio of this numerator over the number of all predicted clusters (\(\left| P \right|\)).

Table 3 shows an example from Maidasani et al. (2012, p.17) for calculating Cluster-F. In the first column, there are three truth clusters (T1, T2, and T3) in which eight name instances with numeric ids (1, 2, 3…8) are assigned. The second column shows predicted results: eight instances in the first column are assigned to two clusters (P1 and P2). After instances are compared across predicted and truth clusters, only one case of \(\left| {P \cap T} \right|\) (P1 = T1) is detected. So, the numerator for cR is 1, while the denominator is 3 (the number of truth clusters), resulting in cR = 1/3. The numerator for cP is also 1 but its denominator is 2 (the number of predicted clusters), resulting in cP = 1/2. Their harmonic mean is 0.4.

Table 3 An Illustration of Cluster-F calculation

The calculation of cR and cP can be implemented as follows.

figure b

In Algorithm 2, the code lines added to Algorithm 1 are highlighted in bold. As a result of running the skeleton code, the hash table tMap records every cluster index i associated with name instances in T and the frequency of each index. If (1) an index i (key)’s frequency in tMap is the same as the size of a truth cluster Tj (value  = \(\left| {T_{j} } \right|\)) and (2) the size of the cluster Pi is the same (cSize[key]  = \(\left| {T_{j} } \right|\)), this means that all and only name instances in the truth cluster appear together in the same predicted cluster. This is a case of the intersection (\(\left| {P \cap T} \right|\)) and increments cMatch, the numerator for cR and cP.

Cluster level: K-metric

K-metric consists of Average Author Purity (AAP), Average Cluster Purity (ACP), and their geometric mean (K) (Santana et al. 2017).

$$AAP = \frac{1}{N}\mathop \sum \limits_{j = 1}^{\left| T \right|} \mathop \sum \limits_{i = 1}^{\left| P \right|} \frac{{n_{ij}^{2} }}{{n_{j} }}$$
(4)
$$ACP = \frac{1}{N}\mathop \sum \limits_{i = 1}^{\left| P \right|} \mathop \sum \limits_{j = 1}^{\left| T \right|} \frac{{n_{ij}^{2} }}{{n_{i} }}$$
(5)
$$K = \sqrt {ACP \times AAP}$$
(6)

Here, T and P represent sets of truth and predicted clusters each. N is the total of name instances to be disambiguated. It is assumed that every name instance in truth clusters is assigned to one of predicted clusters throughout this paper. \(n_{ij}\) is the number of Pi name instances that also appear in Tj; ni and nj represent the numbers of name instances in Pi and Tj, respectively. AAP measures the fragmentation of truth clusters. In other words, its value is low when many instances of an author (= a truth cluster) are split into separate predicted clusters (≈ recall). In contrast, ACP measures the purity of the predicted clusters. The ACP value decreases if predicted clusters contain name instances that should belong to other predicted clusters (≈ precision).

Table 4 illustrates the K-metric calculation. AAP starts by counting the number of name instances in the truth cluster that appear in each predicted cluster. For example, all instances in T1 appear together in P1, thus \(n_{11}^{2}\) = 32 (= 9) and n1 = 3. This repeats over other truth clusters (T2 = 22/2 and T3 = 32/3). The same procedure is applied for ACP but this time staring from P1 being compared to each truth cluster.

Table 4 An Illustration of K-metric Calculation

Equations 4 and 5 can be re-written using a set notation as follows. The order of cluster comparison (truth → predicted or predicted → truth) does not affect the calculation outcome because the final sets of intersection (\(P_{i} \cap T_{j}\)) in AAP and ACP are the same. So, the summation can be ordered as truth clusters being compared to predicted clusters (i.e., truth → predicted) for both AAP and ACP.

$$AAP = \frac{1}{N}\mathop \sum \limits_{j = 1}^{\left| T \right|} \mathop \sum \limits_{i = 1}^{\left| P \right|} \frac{{n_{ij}^{2} }}{{n_{j} }} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|^{2} }}{{\left| {T_{j} } \right|}}$$
(7)
$$ACP = \frac{1}{N}\mathop \sum \limits_{i = 1}^{\left| P \right|} \mathop \sum \limits_{j = 1}^{\left| T \right|} \frac{{n_{ij}^{2} }}{{n_{i} }} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|^{2} }}{{\left| {P_{i} } \right|}}$$
(8)

The revised equations can be implemented by expanding Algorithm 1.

figure c

Algorithm 3 recycles the skeleton code. The added lines to Algorithm 1 are shown in bold. The re-use is possible because in Eqs. 7 and 8, K-metric is re-formulated using a single procedure in which truth clusters are compared to predicted clusters for both AAP and ACP. In contrast, Eqs. 4 and 5 formulate that truth clusters are compared to predicted clusters for AAP and then predicted clusters to truth clusters for ACP.

As all name instances in truth clusters are assigned to one of predicted clusters, the value of N can be obtained by counting instances in either truth (instSum, code line 11) or predicted clusters. In code lines 20–21, \(\left| {P_{i} \cap T_{j} } \right|^{2} /\left| {T_{j} } \right|\) in Eq. 7 and \(\left| {P_{i} \cap T_{j} } \right|^{2} /\left| {P_{i} } \right|\) in Eq. 8 are calculated and summed into aapSum and acpSum, respectively, using the hash values in tMap. Especially, \(\left| {P_{i} } \right|\) is obtained by referencing a predicted cluster index i (key) to cSize generated in code line 7.

Cluster level: Splitting and Lumping Error

Several studies have adopted the concept of Lumping (= merging) and Splitting Error (Kim and Diesner 2016; Lerchenmueller and Sorenson 2016; Li et al. 2014; Liu et al. 2014; Torvik and Smalheiser 2009). Splitting Error (SE) and Lumping Error (LE) are defined as follows (Li et al. 2014):

$$SE = \frac{{\mathop \sum \nolimits_{a} \left| {\left\{ {x |x \in T_{a} , x \notin P_{a} } \right\}} \right|}}{{\mathop \sum \nolimits_{a} \left| {T_{a} } \right|}}$$
(9)
$$LE = \frac{{\mathop \sum \nolimits_{a} \left| {\left\{ {x |x \in P_{a} , x \notin T_{a} } \right\}} \right|}}{{\mathop \sum \nolimits_{a} \left| {P_{a} } \right|}}$$
(10)

Here, x means an author name instance. Ta represents the truth cluster of a unique author a, while Pa means the predicted cluster that contains the largest number of name instances of the unique author a. SE evaluates how many name instances of a unique author (= a truth cluster) fail to appear in the predicted cluster that contains the largest number of name instances associated with the unique author (≈ recall). LE measures how many name instances in a predicted cluster belong to other distinct authors, i.e., truth clusters (≈ precision). Note that SE and LE consider only a predicted cluster that contains the largest number of name instances of a truth cluster. In contrast, Cluster-F considers only the perfect match of all name instances between a predicted cluster and a truth cluster. Others (K-metric, Pairwise-F, and B3) consider all intersection sets of instances between a truth cluster and predicted clusters.

Table 5 illustrates how to calculate SE and LE. The SE calculation starts by comparing name instances in T1 with P1 and P2. P1 contains the largest number of T1 name instances. As there is no name instance in T1 that does not belong to P1, the value for \(\left| {\left\{ {x |x \in T_{a} , x \notin P_{a} } \right\}} \right|\) in Eq. 9 is zero. Likewise, no splitting error case is detected for T2 and T3 because all name instances in T2 and T3 are found in P2, the predicted cluster that contains all name instances of both T2 and T3. Thus, the numerator for SE is 0, while its denominator, sum of all truth cluster sizes, is 8. For LE, name instances in T1 are all found in P1. But name instances in T2 and T3 are lumped with those from T3 and T2, respectively, in the same predicted cluster P2. Regarding the error for T2, three name instances from T3 are wrongly assigned to P2 (thus, lumping error = 3), while for T3, two instances from T2 are wrongly assigned to P2 (thus, lumping error = 2). As both T2 and T3 share the largest predicted cluster, P2, their \(\left| {P_{a} } \right|\) value is 5 (= |P2|).

Table 5 An Illustration of Splitting and Lumping Errors Calculation

A key difference between SE &LE and other four measures is that SE &LE counts errors (split or lumped name instances), while others count correctly predicted name instances. For comparison across five measures, these error-based measures can be converted into recall (eR), precision (eP), and F (eF) measures as follows (Lerchenmueller and Sorenson 2016; Liu et al. 2014; Torvik and Smalheiser 2009):

$$eR = 1 - SE$$
(11)
$$eP = 1 - LE$$
(12)
$$eF = \frac{2 \times eR \times eP}{eR + eP}$$
(13)

This conversion scales eR between 0 (all split) and 1 (no splitting), and eP between 0 (all lumped) and 1 (no lumping). In Table 5, for example, eR = 1 – SE = 1–0 = 1 and eP = 1–LE = 1–0.3846 = 0.6154. Their harmonic mean (= 0.7619) is eF.

Equation 9 and 10 can be re-written using a set notation as follows.

$$SE = \frac{{\mathop \sum \nolimits_{a} \left| {\left\{ {x |x \in T_{a} , x \notin P_{a} } \right\}} \right|}}{{\mathop \sum \nolimits_{a} \left| {T_{a} } \right|}} = \frac{{\mathop \sum \nolimits_{a} (\left| {T_{a} } \right| - \left| {T_{a} \cap P_{a} } \right|)}}{{\mathop \sum \nolimits_{a} \left| {T_{a} } \right|}}$$
(14)
$$LE = \frac{{\mathop \sum \nolimits_{a} \left| {\left\{ {x |x \in P_{a} , x \notin T_{a} } \right\}} \right|}}{{\mathop \sum \nolimits_{a} \left| {P_{a} } \right|}} = \frac{{\mathop \sum \nolimits_{a} (\left| {P_{a} } \right| - \left| {T_{a} \cap P_{a} } \right|)}}{{\mathop \sum \nolimits_{a} \left| {P_{a} } \right|}}$$
(15)

The calculation of SE and LE can be implemented by adding lines to the skeleton code as follows.

figure d

In Algorithm 4, code lines 17 and 19–22 find the predicted cluster index i (key) with the largest frequency (value) from tMap. For an author a (= a truth cluster \(\left| {T_{a} } \right|\)), the maxValue in tMap is used for counting \(\left| {T_{a} \cap P_{a} } \right|\) in Eqs. 14 and 15. In addition, the key for the maxValue is used to obtain the value for cSize[maxKey] \(= \left| {P_{a} } \right|\), which is the size of the predicted cluster that contains the largest number of name instances in the truth cluster \(\left| {T_{a} } \right|\).

Pairwise level: Pairwise-F

This measures clustering performance at a pair-level via pairwise Precision (pP), pairwise Recall (pR), and pairwise F (pF) as defined below (Menestrina et al. 2010):

$$pR = \frac{{\left| {pairs\left( P \right) \cap pairs\left( T \right)} \right|}}{{\left| {pairs\left( T \right)} \right|}}$$
(16)
$$pP = \frac{{\left| {pairs\left( P \right) \cap pairs\left( T \right)} \right|}}{{\left| {pairs\left( P \right)} \right|}}$$
(17)
$$pF = \frac{2 \times pR \times pP}{pR + pP}$$
(18)

Here, pairs(P) and pairs(T) mean name instance pairs generated from the same cluster in predicted clusters P and truth clusters T. The numerator \(\left| {pairs\left( P \right) \cap pairs\left( T \right)} \right|\) is the number of instance pairs that appear both in P and T.

The calculation of pR and pP is illustrated in Table 6. Here, a pair of name instances is represented by two instance ids separated by a vertical bar. In T1, for example, three name instances (1, 2, and 3) are paired into three pairs (1|2, 1|3, and 2|3). The list of name pairs of truth clusters is compared with that of predicted clusters to generate a list of pairs found in both lists. The count of these intersection pairs constitutes the numerator (1|2, 1|3, 2|3, 4|5, 6|7, 6|8, 7|8; 7 pairs), which is divided by the total of pairs in truth clusters (= 7) for pR and by the total of pairs in predicted clusters (= 13) for pP.

Table 6 An Illustration of Pairwise-F Calculation

Calculating pR and pP can be memory- and time-consuming because the number of pairs in a cluster increases in a quadratic way with the size of name instances (Levin et al. 2012; Louppe et al. 2016). For example, the number of pairs for a cluster with 10 instances is 45, while that of a cluster with 1000 instances is 499,500. To overcome this problem, the Pairwise-F measures can be re-written as follows.

$$pR = \frac{{\left| {pairs\left( P \right) \cap pairs\left( T \right)} \right|}}{{\left| {pairs\left( T \right)} \right|}} = \frac{{\mathop \sum \nolimits_{j \in T} \mathop \sum \nolimits_{i \in P} \left| {T_{j} \cap P_{i} } \right| \times \left( {\left| {T_{j} \cap P_{i} } \right| - 1} \right)/2}}{{\mathop \sum \nolimits_{j \in T} \left| {T_{j} } \right| \times \left( {\left| {T_{j} } \right| - 1} \right)/2}}$$
(19)
$$pP = \frac{{\left| {pairs\left( P \right) \cap pairs\left( T \right)} \right|}}{{\left| {pairs\left( P \right)} \right|}} = \frac{{\mathop \sum \nolimits_{j \in T} \mathop \sum \nolimits_{i \in P} \left| {T_{j} \cap P_{i} } \right| \times \left( {\left| {T_{j} \cap P_{i} } \right| - 1} \right)/2}}{{\mathop \sum \nolimits_{i \in P} \left| {P_{i} } \right| \times \left( {\left| {P_{i} } \right| - 1} \right)/2}}$$
(20)

Here, the number of pairs in a cluster is counted not by generating all possible pairs in the cluster but by a heuristic that the number of pairs in a cluster can be calculated from the number of instances in a cluster via cluster size × (cluster size − 1)/2. Likewise, the number of pairs in an intersection can be obtained using the number of instances in it. Algorithm 4 implements this heuristic.

figure e

Again, this implementation of pR and pP is based on the same skeleton code for K-metric and SE and LE as well as Cluster-F. The added code lines to Algorithm 1 are highlighted in bold.

Instance level: B-cubed

This measures clustering performance at an instance-level. Three parts of this measure – B3 Recall (bR), B3 Precision (bP), and B3 F (bF)—are defined as follows (Levin et al. 2012):

$$bR = \frac{1}{N}\mathop \sum \limits_{t \in T} \frac{{\left| {P\left( t \right) \cap T\left( t \right)} \right|}}{{\left| {T\left( t \right)} \right|}}$$
(21)
$$bP = \frac{1}{N}\mathop \sum \limits_{t \in T} \frac{{\left| {P\left( t \right) \cap T\left( t \right)} \right|}}{{\left| {P\left( t \right)} \right|}}$$
(22)
$$bF = \frac{2 \times bR \times bP}{bR + bP}$$
(23)

Here, t is a name instance in truth clusters T. N is the total of instances in truth clusters (T). \(T\left( t \right)\) means a truth cluster that contains a name instance t, while \(P\left( t \right)\) means a predicted cluster that contains the name instance t.

Table 7 shows an illustration of B3 calculation. Starting with the instance 1 in T1 for bR, for example, a predicted cluster containing it is detected: \(P\left( 1 \right) = P_{1}\) and \(T\left( 1 \right) = T_{1}\). Next, the intersection of the truth cluster (T1) and the predicted cluster (P1) is filtered (1, 2, and 3). Then, \(\left| {P_{1} \cap T_{1} } \right|/\left| {T_{1} } \right|\) = 3/3 is obtained. This is repeated for instances 2 and 3 in T1, resulting in an array of (3/3, 3/3, 3/3) for T1. After the same procedure is applied to T2 and T3, the sum of \(\left| {P\left( t \right) \cap T\left( t \right)} \right|/\left| {T\left( t \right)} \right|\) for all name instances is divided by the total of those instances (= 8), producing bR = 1.0.

Table 7 An Illustration of B3 F Calculation

Although B3 is an instance level metric, its calculation can be formulated as a cluster level one. This is possible because in Eqs. 21 and 22, the calculation results for each name instance in the same intersection are the same. In Table 7, for example, instances 4 and 5 in T2 have the same calculation outcome (= 2/2) as they appear together in the intersection of T2 and P2. So, we can re-write (2/2 + 2/2) as (2/2) × 2 = 22/2. Here, 2/2 (or 22) is the calculation outcome for an instance, while 2 besides 2/2 is the number of instances in the intersection (|T2 ∩ P2|). Drawing on this formulation, Eqs. 21 and 22 can be re-written as follows.

$$\begin{aligned} bR & = \frac{1}{N}\mathop \sum \limits_{t \in T} \frac{{\left| {P\left( t \right) \cap T\left( t \right)} \right|}}{{\left| {T\left( t \right)} \right|}} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{{t \in T_{j} }} \frac{{\left| {P\left( t \right) \cap T_{j} } \right|}}{{\left| {T_{j} } \right|}} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{{t \in T_{j} }} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|}}{{\left| {T_{j} } \right|}} \\ & = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|}}{{\left| {T_{j} } \right|}} \times \left| {P_{i} \cap T_{j} } \right| = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|^{2} }}{{\left| {T_{j} } \right|}} = AAP \\ \end{aligned}$$
(24)
$$\begin{aligned} bP & = \frac{1}{N}\mathop \sum \limits_{t \in T} \frac{{\left| {P\left( t \right) \cap T\left( t \right)} \right|}}{{\left| {P\left( t \right)} \right|}} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{{t \in T_{j} }} \frac{{\left| {P\left( t \right) \cap T_{j} } \right|}}{{\left| {P\left( t \right)} \right|}} = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{{t \in T_{j} }} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|}}{{\left| {P_{i} } \right|}} \\ & = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|}}{{\left| {P_{i} } \right|}} \times \left| {P_{i} \cap T_{j} } \right| = \frac{1}{N}\mathop \sum \limits_{j \in T} \mathop \sum \limits_{i \in P} \frac{{\left| {P_{i} \cap T_{j} } \right|^{2} }}{{\left| {P_{i} } \right|}} = ACP \\ \end{aligned}$$
(25)

In Eq. 24, a cluster Tj is set first as a calculation unit (\(\mathop \sum \nolimits_{j \in T} \mathop \sum \nolimits_{{t \in T_{j} }} \left( \right)\)). This follows the transformation of T(t) to Tj because all name instances in Tj have the same set elements (themselves) and thus the same value for \(\left| {T\left( t \right)} \right|\)\(( = \left| {T_{j} } \right|\)). Next, an instance t needs to be checked cluster by cluster to decide where it appears in predicted clusters Pi(t) as in \(\mathop \sum \nolimits_{j \in T} \mathop \sum \nolimits_{{t \in T_{j} }} \mathop \sum \nolimits_{i \in P} \left| {P_{i} \left( t \right) \cap T_{j} } \right|/\left| {T_{j} } \right|\). Evidently, Pi(t) is the same as Pi. Finally, the calculation process can be simplified as \(\mathop \sum \nolimits_{j \in T} \mathop \sum \nolimits_{i \in P} \left| {P_{i} \cap T_{j} } \right|/\left| {T_{j} } \right| \times \left| {P_{i} \cap T_{j} } \right|\). This is because the calculation results of \(\left| {P_{i} \cap T_{j} } \right|/\left| {T_{j} } \right|\) for name instances in the same cluster are the identical if the instances appear in the same intersection (\(P_{i} \cap T_{j}\)). That is why \(\left| {P_{i} \cap T_{j} } \right|/\left| {T_{j} } \right|\) is multiplied by the number of instances belonging to the intersection (\(\left| {P_{i} \cap T_{j} } \right|\)), omitting the part of instance referencing in the nested summation (\(\mathop \sum \nolimits_{{t \in T_{j} }} \left( \right)\)). The final re-writing is the same as the calculation of AAP in Eq. 7. Likewise, bP can be re-written to match ACP (Eq. 25). This transformation can be illustrated by the example in Table 8, where the calculation for B3 and K-metric is juxtaposed to show their similarity.

Table 8 An Illustration of B3 F Calculation in comparison with K-metric Calculation

As such, Eqs. 24 and 25 indicate that bR and bP can be calculated by Algorithm 3 for calculating AAP and ACP. A difference is that B3F is a harmonic mean of AAP (= bR) and ACP (= bP), while K is a geometric mean of AAP and ACP.

All-in-one calculation and runtime test

In Algorithm 2–5, five clustering measures are calculated using the same skeleton code in Algorithm 1. This commonality enables us to integrate them in a single set of code, as in Algorithm 6 below. Note that B3 precision and recall are not calculated because they produce the same results as ACP and AAP in K-metric.

figure f

Besides integrating multiple measures in a single framework, Algorithm 6 reduces computation runtime. To illustrate this, a total of 41,358 name instances in KISTI were used again to evaluate the clustering performance of DBLP’s disambiguation by the five measures as in Fig. 2. For this, especially, the steps implied in the original equations of each measure were implemented straightforwardly. For example, instance pairs per cluster for Pairwise-F were generated (797,297 truth pairs and 826,187 predicted pairs) and compared one by one to find their intersection. Execution time of each measure was measured in seconds and compared to that of the same measure implemented by its corresponding Algorithms 2–5.Footnote 4 Table 9 reports the runtime results.

Table 9 Runtime (in seconds) of measure Calculation by straightforward implementation versus proposed algorithms

Table 9 reports that Algorithms 2–5 calculated each measure less than 0.057 s, while straightforward implementations took approximately 47 (Cluster-F) up to 23,433 (6.5 h, Pairwise-F) seconds. All measures could be calculated in less than 0.065 s by the All-In-One algorithm.

To test the scalability of Algorithm 6, a set of 1.2 M name instances associated with unique identifiers in a high-energy physics publication library, INSPIRE, was obtained (Louppe et al. 2016). Using the INSPIRE unique identifiers as ground truth of author identity, the performance of all-initials-based name disambiguationFootnote 5 was evaluated by the five measures. This task is challenging, especially for the calculation of Pairwise-F, because the number of instance pairs in truth clusters (= 15,388 authors) approximates 213.4 M, while that in predicted clusters (= 18,672) was almost 194.5 M (intersection pairs ≈ 179.9 M). Algorithm 6 produced evaluation results by all measures in 1.583 s. Tested only for the Pairwise-F calculation by Algorithm 5, the runtime was 1.552 s, which is comparable to 12.903 s by the Generalized Merged Distance (GMD) algorithmFootnote 6 (Menestrina et al. 2010), the most runtime-efficient method for calculating Pairwise-F so far.

Conclusion and discussion

This paper demonstrated that five measures of clustering performance in author name disambiguation can be integrated into one calculation framework. This was possible mainly because name instances in truth and predicted clusters were compared not by a brute-force cluster-by-cluster comparison but by the use of two hash tables recording instances with their predicted cluster indices and their frequencies in the predicted-truth cluster intersection. Using set notations, each measure’s equations were formulated to fit into the integrative framework.

A few contributions of this paper are worth noting. First, as there is no standard collection of code for the five clustering measures above, this paper can provide an anchoring place for scholars to implement them and validate their correctness using efficient code and samples. Second, the proposed integration of measures dramatically reduces runtime compared to the straightforward implementation of the measures because it uses hash tables instead of brute-force cluster-by-cluster and instance-by-instance comparisons which can increase runtime up to O(n2). Especially, calculating Pairwise-F was re-formulated using a heuristic to count pairs in a cluster for fast caculation. The scalability of the integrative calculation can help scholars evaluate the clustering performance of a disambiguation method at a large scale, for example, using several millions of name instances associated with Researcher IDs in Web of Science (Backes 2018). This paper demonstrated this potential by evaluating the clustering results of 1.2 M name instances in INSPIRE.

Another contribution is that K-metric and B3 measures were shown to produce the same recall and precision scores. This means that studies using either K-metric or B3 have evaluated their clustering results in almost the same way and thus are comparable to one another. Also, this can be good news to scholars who use K-metric because B3 has been argued to evaluate clustering results better than others on challenging cases (Amigó et al. 2009). In addition, the usage frequency of these two measures in Table 1 equals that of Pairwise-F (= 15), which makes them a family of major clustering measures in author name disambiguation.

Most importantly, the integrative calculation shows that the five measures can be understood within a single framework for their similarities and differences. This can help us modify current measures or propose new ones that assess disambiguation performance from distinctive perspectives. In addition, this integrative framework can incorporate other clustering measures such as Closest-Cluster-F (Menestrina et al. 2010) and Variation of Information (Meilă 2003) which have been rarely used in author name disambiguation. Such integration will not only guide us to select measures characterizing best disambiguation performance but also help future efforts to compare different evaluation approaches under diverse ambiguity conditions for entity resolution in general beyond author name disambiguation.