On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Santana, Alan Filipe; Gonçalves, Marcos André; Laender, Alberto H. F.; Ferreira, Anderson A.

doi:10.1007/s00799-015-0158-y

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Published: 07 July 2015

Volume 16, pages 229–246, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal on Digital Libraries Aims and scope Submit manuscript

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Download PDF

Alan Filipe Santana¹,
Marcos André Gonçalves¹,
Alberto H. F. Laender¹ &
…
Anderson A. Ferreira²

1727 Accesses
26 Citations
Explore all metrics

Abstract

Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Article 29 November 2018

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Article 14 June 2019

Large Scale Name Disambiguation Using Rule-Based Post Processing Combined with Aminer

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

It is a consensus that author name disambiguation (AND) has been one of the hardest problems faced by digital libraries since their early days. This can be demonstrated by the large volume of literature published on the topic in the last decade (e.g., a recent survey cites literally dozens of works [7]) and the continuous interest by the research community in the problem. Although some efforts do exist to provide a global unique identifier to all authors, this will not work in all cases, for instance, while processing textual citations in papers for bibliographic analysis. Thus, automatic solutions, which are highly effective, efficient and practical in most situations, are still in need.

Most automated solutions in the literature exploit either some problem-specific heuristics to define similarity functions to be used by clustering or author assignment solutions or exploit supervised methods that learn such functions [7]. Historically, supervised solutions have empirically outperformed the ones based on heuristics, with the burden of having to rely on manually labeled training sets for the learning process. Such training sets are usually very expensive and cumbersome to obtain. Furthermore, such supervised solutions may not be practical at all in real-word situations in which new ambiguous authors (not present in a “static” training set) do appear all the time and changes in the publication patterns of known authors are common.

Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem that is usually embedded in the heuristic-based solutions. The few exceptions we know [25] extend some generic supervised solutions to consider a few aspects inherent to the problem, thus presenting the best effectiveness results reported in the literature.

In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions based on our experience of almost a decade working on the problem. We use the proposed heuristics and similarity functions to find the nearest author, represented by a cluster of citations (thus the name of our method: nearest cluster), and assign the ambiguous citation to that author. If no cluster is “near enough”, the method assumes that a new author is being inserted into the digital library. There are also some heuristics to incorporate reliable predictions into the disambiguation process and to merge clusters of similar authors. As our solution requires some parameters to be defined, supervision is used mostly to optimize such parameters for each particular dataset. Notice, however, that our solution can be run without such supervised step as it does not learn any particular model from the training set; the model is already encoded in the heuristics and similarity functions we propose.

Our experiments with several collections, using only the minimum amount of information present in bibliographic citations, namely, author names and publication and venue titles, demonstrate that our proposed method has a number of desired properties. First of all, it is highly effective—in the experiments our method has produced significant effectiveness gains against the best state-of-the-art (supervised) method. Second, it is highly efficient, having a low computational complexity. In fact, when compared to this aforementioned method, our solution is orders of magnitude faster (about 6,500 times). If compared with another fast “natural” baseline, our method is as fast as it but up to 58 % more effective. And third, it is highly practical—besides using only the minimum amount of information available in a citation (and thus not relying on other types of information usually hard to obtain or simply not available at all, such as author affiliation or emails), we show that our method is very insensitive to several parameters and very easy to configure using some general rules.

In addition, the proposed solution can run without any supervision, still producing very reasonable results with default parameters, or by exploiting a previously proposed technique for automatically building training sets for AND tasks [6, 9]. Moreover, our method can run in an incremental way, disambiguating only new citations inserted to the digital library (differently from clustering solutions that always disambiguate the digital library as whole) and is able to automatically incorporate new information to the process (e.g., the existence of new authors not previously seen in the training data).

We also perform a thorough analysis of cases of error of our method, i.e., cases in which citations were assigned to the wrong authors. This analysis shows that more than half of the errors are due to the lack of information in the training data or due to the very high ambiguity of the references, showing that improvements are very difficult to obtain and that we are almost on the limits of the effectiveness that can be achieved with such type of solution.

Finally, we provide a qualitative analysis of our solution considering all baselines used for comparison in our experiments and show that our proposed method possesses most of the qualities of a good AND solution should have, mainly when compared to the alternative approaches.

This article is organized as follows. Section 2 covers related work. Section 3 describes our proposed method, including details about how its parameters are estimated. Section 4 details our experimental evaluation. Finally, Sect. 5 concludes the paper, including perspectives for future work.

2 Related work

According to Ferreira et al. [7], AND methods may be broadly classified in two categories according to the type of approach they perform: author grouping methods and author assignment methods.

2.1 Author grouping methods

Author grouping methods exploit the similarity among the citations in order to group them by using some clustering technique. In this category, we may cite several methods [2, 4, 5, 12, 14, 19, 23, 24, 26].

Bhattacharya and Getoor [2] propose a combined similarity function based on terms present in a citation and relational information between disambiguated coauthor names of the citations. A greed agglomerative algorithm uses such a function to group the most similar citations in clusters.

Cota et al. [4] propose a heuristic-based hierarchical clustering method (HHC) that is based on two assumptions: (1) very rarely two citations with similar author names sharing similar coauthor names are two different authors and (2) authors usually publish on the same subjects or venues during their careers. HHC has two steps. The first step creates clusters of citations using the coauthor names and the second step successively fuses similar clusters based on the publication and venue titles. Each cluster contains aggregated information of all citations for each attribute in the cluster, providing more information for the next round of fusion.

In [5], Fan et al. propose GHOST (GrapHical framewOrk for name diSambiguaTion), a framework that represents a collection of citations as a graph in which each vertex represents a citation and each undirected edge represents a coauthorship between two citations. Fan et al. also propose a new similarity function based on the formula that calculates the resistance of a parallel circuit and use the Affinity Propagation clustering algorithm to group the citations of a same author.

Han et al. [12] propose the use of K-way spectral clustering with QR decomposition to obtain a given number of citation clusters where each cluster is associated to an author. To calculate the similarity among the citations, Han et al. apply the cosine similarity function on the references.

Huang et al. [14] exploit DBSCAN, a density-based clustering algorithm, for clustering references by author. The similarity function used by DBSCAN is learned by an active support vector machine algorithm (LaSVM), representing the comparison between two citations by a similarity vector, where each feature represents the comparison of an attribute of the two citations.

In [26], Wu et al. propose a hierarchical agglomerative clustering algorithm based on Dempster-Shafer theory in combination with Shannons entropy to disambiguate the author names. Dempster-Shafer theory fuses evidences to obtain more reliable candidate clusters for fusion and Shannons entropy uncovers the importance of each feature.

Some works propose disambiguating author names in PubMed^{Footnote 1} citations [19, 23, 24]. Torvik et al. [23] propose to learn a probabilistic metric for determining the similarity among PubMed citations while, in [24], Treeratpituk and Giles use a Random Forest classifier to learn the similarity function between citations. Torvik et al. [23] also propose a heuristic for automatically generating training examples and a new agglomerative clustering algorithm for grouping citations of a same author. In [19], Liu et al. present a system for disambiguating author names in PubMed that automatically obtains training examples based on low-frequency author names and pairs of citations in different ambiguous groups, besides using a Huber classifier to learn weight functions jointly with an agglomerative clustering technique to group the citations of a same author.

2.2 Author assignment methods

Author assignment methods directly assign the citations to their corresponding authors using either a supervised classification [6, 8, 10, 25] or a model-based clustering technique [11, 22]. These methods use a training set or perform in an interactive way to obtain models to predict the author of a citation. For example, in [10], Han et al. propose two supervised methods based on naïve Bayes and Support Vector Machines learning techniques. Both methods learn a disambiguation function using a set of training examples to predict the author of each citation. Aiming to eliminate the set of training examples, in [11], Han et al. present an unsupervised hierarchical version of the naïve Bayes-based method for modeling each author by estimating the parameters using the Expectation Maximization Algorithm.

In [22], Tang et al. present a probabilistic framework based on Hidden Markov Random Models for the polysemy subproblem. In such a work, the authors use evidence based on content (i.e., terms of the citation attributes) and relationships between citations (e.g., coauthor names in common) for disambiguating author names. The authors also use Bayesian Information Criterion to estimate the number of authors of a collection.

In [25], Veloso et al. propose SLAND, a method that infers the author of a citation by using a supervised rule-based associative classifier. The method is also capable of improving the coverage of the training set by means of reliable predictions, and detecting authors without any citation in the training set. Aiming to reduce the number of training examples to compose the training set, Ferreira et al. [8] present SAND, a new active sampling strategy based on associative rules for the disambiguation task. Then, in [6, 9], they extend their previous work to eliminate the need of manually providing the training examples.

2.3 Final remarks

Besides classifying the methods according to the type of approach they use, alternatively, we can group them according to the evidence exploited in the AND task: citation attributes (only) [4–6, 10, 24, 25], Web information [15, 16, 20], or implicit data that can be extracted from the available attributes [21]. Some methods [23] also assume the availability of additional information such as emails, affiliations, addresses, paper headers, which is not always available or easy to obtain, although, if existent, may help a lot the process.

Our method, which can be considered an author assignment one, follows a totally different path when compared to what have been historically done in AND tasks. In here, instead of simply applying a generic machine learning solution or adapting it to the problem, we use our accumulated experience on the problem to propose a set of domain-specific heuristics to solve the AND task, using supervision (i.e., training data) only to adapt the method to specific idiosyncrasies of the target dataset. Notice that our proposed method can work without any supervision by using a set of default parameters. We also take advantage of a strategy previously proposed in the literature [6, 9] for automatically building training sets for AND tasks with no manual supervision. Our experimental results show that our method combines aspects of effectiveness, efficiency and practicability rarely found in any other method proposed in the literature.

3 Proposed method

In this section, we present our proposed AND method and the procedures used to estimate its parameters. In our method, citations are represented as sets of terms occurring in the list of coauthors and publication and venue titles^{Footnote 2}. From the list of coauthors, we obtain terms formed by the initial letter of their first name appended to the coauthors’ last name. Terms in the publication and venue titles are obtained after the removal of stop words and stemming of the remaining words, by using Porter’s algorithm [1]. For the equations defined next, we consider the notations presented in Table 1.

Table 1 Notation table

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Abstract

Similar content being viewed by others

Generating automatically labeled data for author name disambiguation: an iterative clustering method

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Large Scale Name Disambiguation Using Rule-Based Post Processing Combined with Aminer

Explore related subjects

1 Introduction

2 Related work

2.1 Author grouping methods

2.2 Author assignment methods

2.3 Final remarks

3 Proposed method

3.1 Nearest cluster model

3.2 Beyond similarity: domain-specific heuristics

3.2.1 Including new citations into the training set

3.2.2 Detecting new authors

3.3 Estimation of parameters

3.3.1 Attribute weights

3.3.2 Minimum confidence

3.3.3 Minimum evidence

3.3.4 Minimum cluster similarity

3.3.5 Self-training

3.4 Computational complexity

4 Experimental evaluation

4.1 Baselines

4.2 Collections

4.3 Evaluation metrics

4.4 Experimental setup

4.5 Results

4.5.1 Error analysis

4.5.2 Runtime analysis

4.5.3 Analyzing the components of our solution

4.5.4 Effectiveness without any training

4.5.5 Sensitivity to parameters \(\gamma \) and \(\phi \)

4.5.6 Qualitative overview of the methods

5 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation