Keywords

1 Introduction

Name disambiguation in digital libraries refers to the task of attributing the publications or citation records to the proper authors [15, 24]. It is common that several authors share the same name or a single author has multiple names in digital library. This will make people feel inconvenient because it is difficult for them to identify which records belong to the proper author. For example, Han et al. [7] found that the author page of “Yu Chen” contains citations authored by three individuals with the same name. Zhu et al. [25] also found that there were at least 50 different authors who called “Wei Wang” in DBLP and there were more than 400 entries under this name. In the other hand, different authors may share the same name label in multiple papers, e.g., both “Jia Zhu” and “Jiaxing Zhu” are used as “J. Zhu” in their papers. In recent years, name disambiguation has become a major challenge when integrating data from multiple sources in bibliographic digital libraries.

At present, disambiguation of homonymous names has received a growing attention with the advent of the semantic web and social networks. Successful name entity disambiguation may greatly help in locating the right researcher and obtaining his/her academic information from the correct homepage, and indexing bibliographic database more accurately and efficiently. In our paper, we think that the role of name disambiguation is that the application of Search Engines, Social Networks, Credit Evaluation, Conflict of Interest and so on. The Table 1 lists the five records with the authors’ name and the paper title. It is difficulty for us to make sure the author “Wen Gao” is the same person. Beyond the problem of sharing the same name among different people, name abbreviations and other reference variations compound the challenge of name disambiguation. This paper is to solve this problem of sharing the same name among different people through the coauthors and paper title.

Table 1. An example of name disambiguation

There are many existing approaches to solve the problem of name ambiguity in the former research. For example, Yang et al. [19] proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correclation, to exploit relationships between records. However, they only considered the Web pages edited by humans to measure Web correlation without using other existent digital libraries. Tang et al. [18] formalized the problems in a unified framework and proposed a generalized probabilistic model to solve this problem. In the difinition section, they assigned six attributes to each paper. Unfortunately, those may not always be practical for all digital libraries because we can not get all information directly.

In the disambiguation features selection, previous works have employed coauthors, title of articles/publications, topics of articles, and years of publications that constitute basic citation data. Title of articles may epitomize research areas of their authors, thus under the assumption that namesakes do not heavily share their research areas, title similarity between articles may be employed in resolving homonymous author names. The same also applies to the case of title of books. In addition to the above record-internal features, some have utilized record-external features such as abstracts, self-records, and citations URLs. When the full text of articles is available, additional features such as e-mail addresses, affiliation, and keywords can be extracted and applied to the author name disambiguation, but it is not available in many digital libraries. In this study, we choose the coauthors and title of paper as the disambiguation features.

In this paper, to reduce errors in the process of hierarchical agglomerative clustering [17, 21] on a list of citation records, we design the concept of ranking confidence to measure the confidence of the different similarity measurements. The ranking confidence can decides which similarity measures to use when merging clusters. We employ three similarity algorithm such as Jaccard Similarity [12, 14], Cosine Similarity [12, 13] and Euclidean DistanceFootnote 1 to contrast the similarity between two clusters. In the process of clustering, we adapt a pair-wise grouping algorithm to group records into clusters, which repeatly merges the most similar pairs of clusters. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evalute our method results and compare with other methods.

The rest of this paper is organized as follows. In Sect. 2, we discuss related work in name disambiguation. In Sect. 3, we decsribe details of our approach including String Similarity, Ranking Confidence and clustering procedure. In Sect. 4, we describe our experiments, evaluation methods and result analysis, and compare our approach with other methods. We also conclude and discuss this study in Sect. 5.

2 Related Work

This section will discuss recent works and prior works for name disambiguation. A great deal of research has focused on the name disambiguation problem in different types of data sets and adapted unsupervised learning approaches or semi-supervised learning methods or supervised learning measures and so on. If the name disambiguation results can be better, it is useful for evaluating faculty publications and calculating statistics of social network and anthors impacts.

There are many existing methods for name disambiguation problems. Kang et al. [8] explored the net effects of co-authorship on author clustering in bibliographic data. They proposed a web-assisted technique of acquiring implicit coauthors of the target author to be disambiguated and considered that the identity of an author can be determined by his/her coauthors. However, Korean was one of best suitable languages for this study and their work can not identity which pages are personal pages. Han et al. [7] proposed an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. However, this clustering method may not work well when there are many ambiguous athors in the dataset and it is unsuitable for large-scale digital libraries since K is not known a priori for an ever increasing digital library.

Lei et al. [3] proposed new research for entity disambiguation with the focus of name disambiguation in digital libraries, they adapted the pairwise similarity and a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion. Zhu et al. [23] proposed an approach that can effectively identified and retrieved information from web pages and used the information to disambiguate authors. Initially, they also implement a web pages identification model by using a neural network classifier and traffic rank. However, the information of web pages are disordered and it is difficulty to extract effective information, especially when the information of author are lacking. Mann et al. [11] presented a set of algorithms for distinguishing personal names with multiple real referents in text, based on little or no supervision. The approach utilized an unsupervised clustering technique over a rich feature space of biographic facts.

Lin et al. [16] proposed a supervised method for exploiting all side information including co-author, organization, paper citation, title similarity, author’s homepage, web constraint and user feedbacks. Although user feedbacks can increase more useful information, but when the amount of data is very large, the user feedbacks information are very difficult to collect and also expend much manpower and material resources in the process of collecting. Li et al. [9] presented a novel categorical set similarity measure named CSLR for two sets which both follow categorical distributions. It is applied in Author Name Disambiguation to measure the similarity between two venue sets or coauthor sets. However, there is only one kind of similarity which can not be persuasive for the different emphasis on the similarity.

Yang et al. [19] proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correlation, to exploit relationships between citations, in order to identify whether two citations with the same author name refer to the same individual. They only extracted citations’ topics from venue information and discover the topic-based relationships and which may not be typical. Yin et al. [20] developed a general object distinction methodology called DISTINCT, which combined two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability. Although the accuracy of this method is high but it relies heavily on the quality of the training data, which is difficult to obtain. On the other hand, the name of the Proceedings and Conferences also has a certain ambiguity, which may affect the calculation of the similarity.

In the attribute selection of data sets, many academic papers choose a variety of attributes. For example, Tang et al. [18] assigned six attributes to each paper, such as the title of the paper, the conference published of paper, the time published of paper, the abstract of paper, the authors of paper and the reference of paper. Han et al. [7] used three types of attributes to design features for name disambiguation: coauthor names, paper titles, and publication venue titles. Han et al. [5] preprocessed the datasets on author names, paper title words and journal title words as follows. Arif et al. [1] chose six attributes: the title of citation, authors of citation, e-mail of author, affiliations of publication and year of publication. In this paper, we only choose two attributes: coauthors and the title of paper for analysis.

3 Proposed Approach

3.1 String Similarity Metrics

In this section, we will use three methods such as Jaccard similarity, Cosine Similarity and Euclidean Distance to calculate the similarity of two paper title, then we adapt the confidence of ranking to determine which similarity values should be used. Our approach based on Hierarchical Agglomerative Clustering (HAC) that can effectively identify the disambiguation authors.

3.1.1 Jaccard Similarity

Among many possible token-based similarity measures, we use Jaccard similarity [12, 14] to calculate the similarity of two paper title. Jaccard similarity coefficient is an index to measure the similarity of two sets. The smaller the value, the more similar between the two titles. We briefly describe the metric below. Using the terms of Table 2, Jaccard coefficient similarity function can be defined as follows:

$$\begin{aligned} Jaccard = \frac{|T_{x}\bigcap T_{y}|}{|T_{x}\bigcup T_{y}|} \end{aligned}$$
(1)
Table 2. Terms in the set of paper title

3.1.2 Cosine Similarity

In this approach, we use vector similarities to measure the similarity of the paper title. The cosine similarity [12, 13] between two vectors (or two title of paper on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between title of paper on a normalized space because we’re not taking into the consideration only the magnitude of each word count (TF-IDF) of each title of paper (we have removed the stop words), but the angle between the paper titles of records.

For example, suppose we have two paper title of citations, one is ‘Why Does Unsupervised Pre-training Help Deep Learning’, another is ‘Deep Learning in Neural Networks: An Overview’, all the words of them are why, does, Unsupervised Pre-training, help, Deep Learning, in, Neural Networks, an, overview, then we remove the stop words, so the all words of them are Unsupervised Pre-training, help, Deep Learning, Neural Networks, overview. Fially, we have a vector \(V(title)_{1} = [1, 1, 1, 0, 0]\) and another vector \(V(title)_{2} = [0, 0, 1, 1, 1]\). We use the simple cosine similarity, an angle between two vectors, defined as:

$$\begin{aligned} cos\theta = \frac{\overrightarrow{V(title)_{1}} \times \overrightarrow{V(title)_{2}}}{||V(title)_{1}|| \times ||V(title)_{2}||} \end{aligned}$$
(2)

3.1.3 Euclidean Distance

The Euclidean DistanceFootnote 2 between points x and y is the length of the line segment connecting them (\({\overline{{\mathbf {x}} {\mathbf {y}} }}\)). We first calculate the Euclidean distance, and then we can get the similarity between the two titles of paper. The farther the distance between them, the greater the difference among the titles. One of the most prominent instantiations of the function is as follows.

$$\begin{aligned} dist(X,Y) = \sqrt{\sum _{i=1}^n (x_{i}-y_{i})^2} \end{aligned}$$
(3)

3.2 Ranking Confidence

To minimize the risk of using only one similarity metric, we use the ranking of confidence to determine which metric should used for clustering. For example, there is one cluster R which includes three citation records \(r_{1}\), \(r_{2}\), \(r_{3}\), if we want to know whether a citation record q can be merged into R, we should calculate the similarity between q and each record in R. Suppose that the similarity between \(r_{1}\) and q is ranked lower than all other records using Jaccard Similarity, but it is ranked higher than other citation records using Cosine Similarity, we do not know the exact ranking should be in this case. Therefore, we adapt the confidence of ranking to determine which similarity values should be used.

After calculating three similarity between two title of the paper respectively, we design an algorithm to calculate the ranking of confidence. As we know, the confidence of three similarity in rankings should be highly related to the distribution of similarity values [2]. The functions are defined as follows Eqs. (4), (5), (6):

$$\begin{aligned} d_{(r,q,f,k)} = {|sim_{f}(r,q,k)-sim_{f}(r,q,t)|} (k=1, 2, 3\ldots n), \end{aligned}$$
(4)
$$\begin{aligned} C_{(r,q,f,k)} = \frac{1-d_{(r,q,f,k)}}{\sum _{i=1}^n C_{(r,q,f,i)}} (k=1, 2, 3\ldots n), \end{aligned}$$
(5)
$$\begin{aligned} C_{(r,q,k)final} = average{(\sum _{i=1}^n C_{(r,q,f,k)})}. \end{aligned}$$
(6)

Where \(C_{(r,q,f,k)}\) is the confidence of similarity method f in the ranking of cluster r. We model \(C_{(r,q,f,k)}\) as the probability of cluster r is ranked at position k in the final ranking given that its similarity value is sim\(_{f}(r,q,k)\), the sim\(_{f}(r,q,k)\) is the similarity value of the method f. The \(d_{(r,q,f,k)}\) tries to compute the difference between \(sim_{f}(r,q,k)\) and \(sim_{f}(r,q,t)\), and uses it to model the required confidence value, the \(sim_{f}(r,q,t)\) indicates the top value of the \(sim_{f}(r,q,k)\) \((\text {k}\,=\,1, 2, 3 \ldots \,\text {n})\). In Eq. (6), \(C_{(r,q,k)final}\) is the final value of confidence through calculating average.

Fig. 1.
figure 1

The process of calculating the value of confidence

The Fig. 1 shows the process of calculating the value of confidence. In this process, we want to calculate the confidence value between a citation record and the every citation record of cluster1, the S(1, J) and C(1, J) stands for the value of similarity using the Jaccard Simalarity and the value of confidence between the title of citation between the title1 of cluster1 respectively, the S(1, C) and and C(1, C) indicates the value of similarity using the Cosine Similarity and the value of confidence between the title of citation between the title1 of cluster1 respectively, the S(1, E) and C(1, E) represents the value of similarity using the Euclidean Distance and the value of confidence between the title of citation between the title1 of cluster1 respectively. The value of confidence C(1, J), C(1, C) and C(1, E) are calculated by Eqs. 4, 5. Finally, we get the final value \(C_J\_final\), \(C_C\_final\) and \(C_E\_final\) of confidence through Eq. 6.

For example, suppose that we hava a citation record Q, and a cluster R includes three citation records \(R_{1}\), \(R_{2}\), \(R_{3}\), we first calculate the three similarity value by Jaccard similarity method between Q and \(R_{1}\), \(R_{2}\), \(R_{3}\), the value is \(S_{1}\), \(S_{2}\), \(S_{3}\), then we can get three corresponding confidence values \(C_1\), \(C_2\), \(C_3\) by Eqs. (4), (5). Therefore, by Eq. (6), we can get the result \(C_J\_final\) as the final value of Jaccard similarity to compare with other similarity value of cluster1. According to the confidence of ranking, suppose that the \(C_J\_final\) value is the highest between \(C_J\_final\), \(C_C\_final\) and \(C_E\_final\), then we use the Jaccard simalrity value as the final value of cluster1 to compare with other clusters’ similarity value.

3.3 Clustering Procedure

In the paper, we employ Hierarchical Agglomerative Clustering (HAC) [17, 21] as the basic framework. It starts with each paper being a cluster, we first cluster the citations on HAC, then we find the most similar (the used similarity measuers will be definded later) pairs of clusters, and merge them, until the maximal similarity falls below certain threshold. In this process, we choose two kinds of attributes to cluster: coauthors and paper title. The whole clustering process divides into two stages:

  1. 1.

    Merge based on the evidence from shared coauthors based on Hierarchical Agglomerative Clustering;

  2. 2.

    Merge based on the combined similarity defined on the title sets of each pair of clusters.

The reasons for developing the two-stage clustering are two points: Firstly, coauthors often provide stronger evidence than other features, based on which the generated cluster usually comprises of papers of the same author, but the papers of an author may distribute among multiple clusters [4]; Secondly, the paper title feature is weak evidence, based on which we can furter merge clusters from the same author. More importantly, we adapt the Ranking Confidence method to decide whether one cluster can merge into another cluster.

Our modified agglomerative clustering method is shown in Algorithm 1. This approach based on Hierarchical Agglomerative Clustering (HAC) that can effectively identify the disambiguation authors, the whole clustering process divides into two stages. In the first stage, we employ a pair-wise groupting algorithm to group records into clusters, which based on coauthors’ name. Then, we use three similarity algorithm such as Jaccard Similarity, Cosine Similarity and Euclidean Distance to contrast the similarity between two clusters. To minimize the risk of using only one similarity metric, we use the ranking of confidence to determine which metric should used for clustering. When we get the final metric, we will check if two of three citation records are greater than the threshold we set for each similarity metric. If two of three citation records are greater than the threshold, we will merge the citation record into the cluster, otherwise we will stop the clustering process. This threshold is adjusted during the experiment.

figure a

4 Experiments

4.1 Data Sets

In our experiments, we perform evaluations on a dataset constructed by Tang et al. [18], which contain the citations collected from the DBLP Website. We downloaded this dataset from the WebsiteFootnote 3. Each citation record consists of three basic attributes: title, coauthors, and venue, but we only choose the two attributes of those three basic attributes: title of paper and coauthors to our experiments. In this paper, we collected 110 author citation records which includes 1723 individual authors and 8505 citation records. Some statistics of this data set are shown in Table 3. For example, there are 28 persons with the name “David Brown” and 40 persons named “Lei Chen”.

Table 3. Evaluation dataset

4.2 Evaluation Results

As in [18, 20], we use Pairewise Precision, Pairwise Recall, and Pairwise F1 score to evaluate the performance of our method and to compare with previous methods. The pairwise measures are adapted for evaluating name disambiguation by considering the number of pairs of papers assigned with the same label. For example, if author one and author two whose names are the same in the two papers, we think that those two papers are the same as the author one and is one pair with the same label. Specifically, any two papers that are annotated with the same label in the ground truth are called a correct pair, and any two papers are predicted with the same label. For two papers with the same label predicted by an approach, but not have the same label in the ground truth, we call it a wrong pair. We note the counting is for pairs of papers with the same label (either predicted or labeled) only. Thus, we can define the measures as follows Eqs. 7, 8, 9:

$$\begin{aligned} PairPrecision = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsPredictedToSameAuthor} \end{aligned}$$
(7)
$$\begin{aligned} PairRecall = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsToSameAuthor} \end{aligned}$$
(8)
$$\begin{aligned} PairF1 = \frac{2 \times PairPrecision \times PairRecall}{PairPrecision + PairRecall} \end{aligned}$$
(9)

In this paper, we compare several baseline methods, such as Hierarchical Agglomerative Clustering (HAC) [17, 21], K-means [10], SACluster [22] to evaluate our approach. HAC only used the Jaccard Similarity to measure the similarity between the citations and based on a list of citations and utilized a search engine to help the disambiguation task. K-means is a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process appears to give partitions which are reasonably efficient in the sense of within-class variance [10], In K-means algorithm, each citation is represented by a feature vector, with each coauthor name and each keyword of the paper title as a feature of the vector [6]. SACluster tries to partition the nodes in a graph into K clusters by using both structural and attributes information associated to each node [18].

For fair comparison, we try to input the same attribute features defined in our method in these methods. Our approach consider the baseline method Hierarchical Agglomerative Clustering (HAC) and Ranking Confidence to help the disambiguation task, with the feature coauthors and title of paper. We conducted disambiguation experiments for papers related to each of the author names in our data set. Table 4 shows the results of some examples in our data sets. It is clearly indicated that our approach outperforms the baseline methords for name disambiguation, (+13.15% over HAC, +46.62% over K-means, +20.18% over SACluster by average F1 score). On the other hand, our approach have much higher precision compared to other methods excepts HAC and higher recall than other three methods. As we can see, the recall and F1 of HAC with ranking confidence (our proposed method) higher than the HAC without ranking confidence. For example, the results of “Sanjay Jain” and “Qiang shen” demonstrate that our approach have much higher precision, recall and F1 than the method HAC. Obviously, the author “Charles Smith”’s result indicates that our approach much better than the K-means and the result of the author “Thomas D. Taylor” suggests that our approach have much better than the method K-means and SACluster.

Table 4. Results of name disambiguation (Percent)

5 Conclusion and Discussion

Name disambiguation in databases is a non-trivial task because different person can share the same name and one person can have many name variations. What’s more, in most case only limited information is associated with each name in the database. This paper describes a clustering approach for name disambiguation in DBLP, which only use the coauthor and paper title information of each person. Firstly, we group records of the same name into different clusters according to coauthors. Then, merging two clusters if the similarity of their titles reach the threshold. To reduce chance of selecting similarity algorithm, we propose an algorithm called Ranking Confidence. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evaluate our method and compare with other methods. Experiment results show that the approach efficiently differentiate authors with the same name and generate better results than baseline methods: HAC, K-means, SACluster when using only two attributes: coauthors and title of paper. If the name disambiguation results can be better, it is useful for evaluating faculty publications and calculating statistics of social network and anthors impacts. In the future, we will pay more attention to the merge method of different clusters.