A Novel Approach for Author Name Disambiguation Using Ranking Confidence

Lin, Xueqin; Zhu, Jia; Tang, Yong; Yang, Fen; Peng, Bo; Li, Weiling

doi:10.1007/978-3-319-55705-2_13

Xueqin Lin¹⁷,
Jia Zhu¹⁷,
Yong Tang¹⁷,
Fen Yang¹⁷,
Bo Peng¹⁷ &
…
Weiling Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10179))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1786 Accesses
6 Citations

Abstract

In digital libraries, ambiguous author names may occur because of the existence of multiple authors with the same name or different name variations for the same person. In recent years, name disambiguation has become a major challenge when integrating data from multiple sources in bibliographic digital libraries. Most of the previous works solve this issue by using many attributes, such as coauthors, title of articles/publications, topics of articles, and years of publications. However, in most cases, we can only get the coauthor and title attributes. In this paper, we propose an approach which is based on Hierarchical Agglomerative Clustering (HAC) and only use the coauthor and title attributes, but can more effectively identify the disambiguation authors. The whole algorithm can divide into two stages. In the first stage, we employ a pair-wise grouping algorithm which is based on coauthors’name to group records into clusters. Then, we merge two clusters if the similarity of the article titles from two clusters reach the threshold. Here, we use three kinds of similarity algorithms such as Jaccard Similarity, Cosine Similarity and Euclidean Distance to compare the similarity between the titles of two clusters. To minimize the risk of using only one similarity metric, we design the concept of ranking confidence to measure the confidence of different similarity meausrements. The ranking confidence decides which similarity measure to use when merging clusters. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evaluate our method and compare with other methods. Experimental results indicate that our method significantly outperforms the baseline methods: HAC, K-means and SACluster when only use coauthor and title attributes.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Article 16 February 2018

Author Name Disambiguation Based on Rule and Graph Model

Large Scale Name Disambiguation Using Rule-Based Post Processing Combined with Aminer

Keywords

1 Introduction

Name disambiguation in digital libraries refers to the task of attributing the publications or citation records to the proper authors [15, 24]. It is common that several authors share the same name or a single author has multiple names in digital library. This will make people feel inconvenient because it is difficult for them to identify which records belong to the proper author. For example, Han et al. [7] found that the author page of “Yu Chen” contains citations authored by three individuals with the same name. Zhu et al. [25] also found that there were at least 50 different authors who called “Wei Wang” in DBLP and there were more than 400 entries under this name. In the other hand, different authors may share the same name label in multiple papers, e.g., both “Jia Zhu” and “Jiaxing Zhu” are used as “J. Zhu” in their papers. In recent years, name disambiguation has become a major challenge when integrating data from multiple sources in bibliographic digital libraries.

At present, disambiguation of homonymous names has received a growing attention with the advent of the semantic web and social networks. Successful name entity disambiguation may greatly help in locating the right researcher and obtaining his/her academic information from the correct homepage, and indexing bibliographic database more accurately and efficiently. In our paper, we think that the role of name disambiguation is that the application of Search Engines, Social Networks, Credit Evaluation, Conflict of Interest and so on. The Table 1 lists the five records with the authors’ name and the paper title. It is difficulty for us to make sure the author “Wen Gao” is the same person. Beyond the problem of sharing the same name among different people, name abbreviations and other reference variations compound the challenge of name disambiguation. This paper is to solve this problem of sharing the same name among different people through the coauthors and paper title.

Table 1. An example of name disambiguation

Full size table

There are many existing approaches to solve the problem of name ambiguity in the former research. For example, Yang et al. [19] proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correclation, to exploit relationships between records. However, they only considered the Web pages edited by humans to measure Web correlation without using other existent digital libraries. Tang et al. [18] formalized the problems in a unified framework and proposed a generalized probabilistic model to solve this problem. In the difinition section, they assigned six attributes to each paper. Unfortunately, those may not always be practical for all digital libraries because we can not get all information directly.

In the disambiguation features selection, previous works have employed coauthors, title of articles/publications, topics of articles, and years of publications that constitute basic citation data. Title of articles may epitomize research areas of their authors, thus under the assumption that namesakes do not heavily share their research areas, title similarity between articles may be employed in resolving homonymous author names. The same also applies to the case of title of books. In addition to the above record-internal features, some have utilized record-external features such as abstracts, self-records, and citations URLs. When the full text of articles is available, additional features such as e-mail addresses, affiliation, and keywords can be extracted and applied to the author name disambiguation, but it is not available in many digital libraries. In this study, we choose the coauthors and title of paper as the disambiguation features.

In this paper, to reduce errors in the process of hierarchical agglomerative clustering [17, 21] on a list of citation records, we design the concept of ranking confidence to measure the confidence of the different similarity measurements. The ranking confidence can decides which similarity measures to use when merging clusters. We employ three similarity algorithm such as Jaccard Similarity [12, 14], Cosine Similarity [12, 13] and Euclidean Distance^{Footnote 1} to contrast the similarity between two clusters. In the process of clustering, we adapt a pair-wise grouping algorithm to group records into clusters, which repeatly merges the most similar pairs of clusters. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evalute our method results and compare with other methods.

The rest of this paper is organized as follows. In Sect. 2, we discuss related work in name disambiguation. In Sect. 3, we decsribe details of our approach including String Similarity, Ranking Confidence and clustering procedure. In Sect. 4, we describe our experiments, evaluation methods and result analysis, and compare our approach with other methods. We also conclude and discuss this study in Sect. 5.

2 Related Work

This section will discuss recent works and prior works for name disambiguation. A great deal of research has focused on the name disambiguation problem in different types of data sets and adapted unsupervised learning approaches or semi-supervised learning methods or supervised learning measures and so on. If the name disambiguation results can be better, it is useful for evaluating faculty publications and calculating statistics of social network and anthors impacts.

There are many existing methods for name disambiguation problems. Kang et al. [8] explored the net effects of co-authorship on author clustering in bibliographic data. They proposed a web-assisted technique of acquiring implicit coauthors of the target author to be disambiguated and considered that the identity of an author can be determined by his/her coauthors. However, Korean was one of best suitable languages for this study and their work can not identity which pages are personal pages. Han et al. [7] proposed an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. However, this clustering method may not work well when there are many ambiguous athors in the dataset and it is unsuitable for large-scale digital libraries since K is not known a priori for an ever increasing digital library.

Lei et al. [3] proposed new research for entity disambiguation with the focus of name disambiguation in digital libraries, they adapted the pairwise similarity and a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion. Zhu et al. [23] proposed an approach that can effectively identified and retrieved information from web pages and used the information to disambiguate authors. Initially, they also implement a web pages identification model by using a neural network classifier and traffic rank. However, the information of web pages are disordered and it is difficulty to extract effective information, especially when the information of author are lacking. Mann et al. [11] presented a set of algorithms for distinguishing personal names with multiple real referents in text, based on little or no supervision. The approach utilized an unsupervised clustering technique over a rich feature space of biographic facts.

Lin et al. [16] proposed a supervised method for exploiting all side information including co-author, organization, paper citation, title similarity, author’s homepage, web constraint and user feedbacks. Although user feedbacks can increase more useful information, but when the amount of data is very large, the user feedbacks information are very difficult to collect and also expend much manpower and material resources in the process of collecting. Li et al. [9] presented a novel categorical set similarity measure named CSLR for two sets which both follow categorical distributions. It is applied in Author Name Disambiguation to measure the similarity between two venue sets or coauthor sets. However, there is only one kind of similarity which can not be persuasive for the different emphasis on the similarity.

Yang et al. [19] proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correlation, to exploit relationships between citations, in order to identify whether two citations with the same author name refer to the same individual. They only extracted citations’ topics from venue information and discover the topic-based relationships and which may not be typical. Yin et al. [20] developed a general object distinction methodology called DISTINCT, which combined two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability. Although the accuracy of this method is high but it relies heavily on the quality of the training data, which is difficult to obtain. On the other hand, the name of the Proceedings and Conferences also has a certain ambiguity, which may affect the calculation of the similarity.

In the attribute selection of data sets, many academic papers choose a variety of attributes. For example, Tang et al. [18] assigned six attributes to each paper, such as the title of the paper, the conference published of paper, the time published of paper, the abstract of paper, the authors of paper and the reference of paper. Han et al. [7] used three types of attributes to design features for name disambiguation: coauthor names, paper titles, and publication venue titles. Han et al. [5] preprocessed the datasets on author names, paper title words and journal title words as follows. Arif et al. [1] chose six attributes: the title of citation, authors of citation, e-mail of author, affiliations of publication and year of publication. In this paper, we only choose two attributes: coauthors and the title of paper for analysis.

3 Proposed Approach

3.1 String Similarity Metrics

In this section, we will use three methods such as Jaccard similarity, Cosine Similarity and Euclidean Distance to calculate the similarity of two paper title, then we adapt the confidence of ranking to determine which similarity values should be used. Our approach based on Hierarchical Agglomerative Clustering (HAC) that can effectively identify the disambiguation authors.

3.1.1 Jaccard Similarity

Among many possible token-based similarity measures, we use Jaccard similarity [12, 14] to calculate the similarity of two paper title. Jaccard similarity coefficient is an index to measure the similarity of two sets. The smaller the value, the more similar between the two titles. We briefly describe the metric below. Using the terms of Table 2, Jaccard coefficient similarity function can be defined as follows:

$$\begin{aligned} Jaccard = \frac{|T_{x}\bigcap T_{y}|}{|T_{x}\bigcup T_{y}|} \end{aligned}$$

(1)

Table 2. Terms in the set of paper title

Full size table

3.1.2 Cosine Similarity

In this approach, we use vector similarities to measure the similarity of the paper title. The cosine similarity [12, 13] between two vectors (or two title of paper on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between title of paper on a normalized space because we’re not taking into the consideration only the magnitude of each word count (TF-IDF) of each title of paper (we have removed the stop words), but the angle between the paper titles of records.

For example, suppose we have two paper title of citations, one is ‘Why Does Unsupervised Pre-training Help Deep Learning’, another is ‘Deep Learning in Neural Networks: An Overview’, all the words of them are why, does, Unsupervised Pre-training, help, Deep Learning, in, Neural Networks, an, overview, then we remove the stop words, so the all words of them are Unsupervised Pre-training, help, Deep Learning, Neural Networks, overview. Fially, we have a vector $V(title)_{1} = [1, 1, 1, 0, 0]$ and another vector $V(title)_{2} = [0, 0, 1, 1, 1]$. We use the simple cosine similarity, an angle between two vectors, defined as:

$$\begin{aligned} cos\theta = \frac{\overrightarrow{V(title)_{1}} \times \overrightarrow{V(title)_{2}}}{||V(title)_{1}|| \times ||V(title)_{2}||} \end{aligned}$$

(2)

3.1.3 Euclidean Distance

The Euclidean Distance^{Footnote 2} between points x and y is the length of the line segment connecting them (${\overline{{\mathbf {x}} {\mathbf {y}} }}$). We first calculate the Euclidean distance, and then we can get the similarity between the two titles of paper. The farther the distance between them, the greater the difference among the titles. One of the most prominent instantiations of the function is as follows.

$$\begin{aligned} dist(X,Y) = \sqrt{\sum _{i=1}^n (x_{i}-y_{i})^2} \end{aligned}$$

(3)

3.2 Ranking Confidence

To minimize the risk of using only one similarity metric, we use the ranking of confidence to determine which metric should used for clustering. For example, there is one cluster R which includes three citation records $r_{1}$, $r_{2}$, $r_{3}$, if we want to know whether a citation record q can be merged into R, we should calculate the similarity between q and each record in R. Suppose that the similarity between $r_{1}$ and q is ranked lower than all other records using Jaccard Similarity, but it is ranked higher than other citation records using Cosine Similarity, we do not know the exact ranking should be in this case. Therefore, we adapt the confidence of ranking to determine which similarity values should be used.

After calculating three similarity between two title of the paper respectively, we design an algorithm to calculate the ranking of confidence. As we know, the confidence of three similarity in rankings should be highly related to the distribution of similarity values [2]. The functions are defined as follows Eqs. (4), (5), (6):

$$\begin{aligned} d_{(r,q,f,k)} = {|sim_{f}(r,q,k)-sim_{f}(r,q,t)|} (k=1, 2, 3\ldots n), \end{aligned}$$

(4)

$$\begin{aligned} C_{(r,q,f,k)} = \frac{1-d_{(r,q,f,k)}}{\sum _{i=1}^n C_{(r,q,f,i)}} (k=1, 2, 3\ldots n), \end{aligned}$$

(5)

$$\begin{aligned} C_{(r,q,k)final} = average{(\sum _{i=1}^n C_{(r,q,f,k)})}. \end{aligned}$$

(6)

Where $C_{(r,q,f,k)}$ is the confidence of similarity method f in the ranking of cluster r. We model $C_{(r,q,f,k)}$ as the probability of cluster r is ranked at position k in the final ranking given that its similarity value is sim$_{f}(r,q,k)$, the sim$_{f}(r,q,k)$ is the similarity value of the method f. The $d_{(r,q,f,k)}$ tries to compute the difference between $sim_{f}(r,q,k)$ and $sim_{f}(r,q,t)$, and uses it to model the required confidence value, the $sim_{f}(r,q,t)$ indicates the top value of the $sim_{f}(r,q,k)$ $(\text {k}\,=\,1, 2, 3 \ldots \,\text {n})$. In Eq. (6), $C_{(r,q,k)final}$ is the final value of confidence through calculating average.

The Fig. 1 shows the process of calculating the value of confidence. In this process, we want to calculate the confidence value between a citation record and the every citation record of cluster1, the S(1, J) and C(1, J) stands for the value of similarity using the Jaccard Simalarity and the value of confidence between the title of citation between the title1 of cluster1 respectively, the S(1, C) and and C(1, C) indicates the value of similarity using the Cosine Similarity and the value of confidence between the title of citation between the title1 of cluster1 respectively, the S(1, E) and C(1, E) represents the value of similarity using the Euclidean Distance and the value of confidence between the title of citation between the title1 of cluster1 respectively. The value of confidence C(1, J), C(1, C) and C(1, E) are calculated by Eqs. 4, 5. Finally, we get the final value $C_J\_final$, $C_C\_final$ and $C_E\_final$ of confidence through Eq. 6.

For example, suppose that we hava a citation record Q, and a cluster R includes three citation records $R_{1}$, $R_{2}$, $R_{3}$, we first calculate the three similarity value by Jaccard similarity method between Q and $R_{1}$, $R_{2}$, $R_{3}$, the value is $S_{1}$, $S_{2}$, $S_{3}$, then we can get three corresponding confidence values $C_1$, $C_2$, $C_3$ by Eqs. (4), (5). Therefore, by Eq. (6), we can get the result $C_J\_final$ as the final value of Jaccard similarity to compare with other similarity value of cluster1. According to the confidence of ranking, suppose that the $C_J\_final$ value is the highest between $C_J\_final$, $C_C\_final$ and $C_E\_final$, then we use the Jaccard simalrity value as the final value of cluster1 to compare with other clusters’ similarity value.

3.3 Clustering Procedure

In the paper, we employ Hierarchical Agglomerative Clustering (HAC) [17, 21] as the basic framework. It starts with each paper being a cluster, we first cluster the citations on HAC, then we find the most similar (the used similarity measuers will be definded later) pairs of clusters, and merge them, until the maximal similarity falls below certain threshold. In this process, we choose two kinds of attributes to cluster: coauthors and paper title. The whole clustering process divides into two stages:

1.
Merge based on the evidence from shared coauthors based on Hierarchical Agglomerative Clustering;
2.
Merge based on the combined similarity defined on the title sets of each pair of clusters.

The reasons for developing the two-stage clustering are two points: Firstly, coauthors often provide stronger evidence than other features, based on which the generated cluster usually comprises of papers of the same author, but the papers of an author may distribute among multiple clusters [4]; Secondly, the paper title feature is weak evidence, based on which we can furter merge clusters from the same author. More importantly, we adapt the Ranking Confidence method to decide whether one cluster can merge into another cluster.

Our modified agglomerative clustering method is shown in Algorithm 1. This approach based on Hierarchical Agglomerative Clustering (HAC) that can effectively identify the disambiguation authors, the whole clustering process divides into two stages. In the first stage, we employ a pair-wise groupting algorithm to group records into clusters, which based on coauthors’ name. Then, we use three similarity algorithm such as Jaccard Similarity, Cosine Similarity and Euclidean Distance to contrast the similarity between two clusters. To minimize the risk of using only one similarity metric, we use the ranking of confidence to determine which metric should used for clustering. When we get the final metric, we will check if two of three citation records are greater than the threshold we set for each similarity metric. If two of three citation records are greater than the threshold, we will merge the citation record into the cluster, otherwise we will stop the clustering process. This threshold is adjusted during the experiment.

4 Experiments

4.1 Data Sets

In our experiments, we perform evaluations on a dataset constructed by Tang et al. [18], which contain the citations collected from the DBLP Website. We downloaded this dataset from the Website^{Footnote 3}. Each citation record consists of three basic attributes: title, coauthors, and venue, but we only choose the two attributes of those three basic attributes: title of paper and coauthors to our experiments. In this paper, we collected 110 author citation records which includes 1723 individual authors and 8505 citation records. Some statistics of this data set are shown in Table 3. For example, there are 28 persons with the name “David Brown” and 40 persons named “Lei Chen”.

Table 3. Evaluation dataset

Full size table

4.2 Evaluation Results

As in [18, 20], we use Pairewise Precision, Pairwise Recall, and Pairwise F1 score to evaluate the performance of our method and to compare with previous methods. The pairwise measures are adapted for evaluating name disambiguation by considering the number of pairs of papers assigned with the same label. For example, if author one and author two whose names are the same in the two papers, we think that those two papers are the same as the author one and is one pair with the same label. Specifically, any two papers that are annotated with the same label in the ground truth are called a correct pair, and any two papers are predicted with the same label. For two papers with the same label predicted by an approach, but not have the same label in the ground truth, we call it a wrong pair. We note the counting is for pairs of papers with the same label (either predicted or labeled) only. Thus, we can define the measures as follows Eqs. 7, 8, 9:

$$\begin{aligned} PairPrecision = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsPredictedToSameAuthor} \end{aligned}$$

(7)

$$\begin{aligned} PairRecall = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsToSameAuthor} \end{aligned}$$

(8)

$$\begin{aligned} PairF1 = \frac{2 \times PairPrecision \times PairRecall}{PairPrecision + PairRecall} \end{aligned}$$

(9)

In this paper, we compare several baseline methods, such as Hierarchical Agglomerative Clustering (HAC) [17, 21], K-means [10], SACluster [22] to evaluate our approach. HAC only used the Jaccard Similarity to measure the similarity between the citations and based on a list of citations and utilized a search engine to help the disambiguation task. K-means is a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process appears to give partitions which are reasonably efficient in the sense of within-class variance [10], In K-means algorithm, each citation is represented by a feature vector, with each coauthor name and each keyword of the paper title as a feature of the vector [6]. SACluster tries to partition the nodes in a graph into K clusters by using both structural and attributes information associated to each node [18].

For fair comparison, we try to input the same attribute features defined in our method in these methods. Our approach consider the baseline method Hierarchical Agglomerative Clustering (HAC) and Ranking Confidence to help the disambiguation task, with the feature coauthors and title of paper. We conducted disambiguation experiments for papers related to each of the author names in our data set. Table 4 shows the results of some examples in our data sets. It is clearly indicated that our approach outperforms the baseline methords for name disambiguation, (+13.15% over HAC, +46.62% over K-means, +20.18% over SACluster by average F1 score). On the other hand, our approach have much higher precision compared to other methods excepts HAC and higher recall than other three methods. As we can see, the recall and F1 of HAC with ranking confidence (our proposed method) higher than the HAC without ranking confidence. For example, the results of “Sanjay Jain” and “Qiang shen” demonstrate that our approach have much higher precision, recall and F1 than the method HAC. Obviously, the author “Charles Smith”’s result indicates that our approach much better than the K-means and the result of the author “Thomas D. Taylor” suggests that our approach have much better than the method K-means and SACluster.

Table 4. Results of name disambiguation (Percent)

Full size table

5 Conclusion and Discussion

Name disambiguation in databases is a non-trivial task because different person can share the same name and one person can have many name variations. What’s more, in most case only limited information is associated with each name in the database. This paper describes a clustering approach for name disambiguation in DBLP, which only use the coauthor and paper title information of each person. Firstly, we group records of the same name into different clusters according to coauthors. Then, merging two clusters if the similarity of their titles reach the threshold. To reduce chance of selecting similarity algorithm, we propose an algorithm called Ranking Confidence. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evaluate our method and compare with other methods. Experiment results show that the approach efficiently differentiate authors with the same name and generate better results than baseline methods: HAC, K-means, SACluster when using only two attributes: coauthors and title of paper. If the name disambiguation results can be better, it is useful for evaluating faculty publications and calculating statistics of social network and anthors impacts. In the future, we will pay more attention to the merge method of different clusters.

Notes

References

Arif, T., Ali, R., Asger, M.: Author name disambiguation using vector space model and hybrid similarity measures. In: International Conference on Contemporary Computing-IC, pp. 135–140 (2014)
Google Scholar
Bishop, T.A., Dudewicz, E.J.: Complete ranking of reliability-related distributions. IEEE Trans. Reliab. R–26(5), 362–365 (1977)
Article MATH Google Scholar
Cen, L., Dragut, E.C., Si, L., Ouzzani, M.: Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (2013)
Google Scholar
Cota, R.G., Ferreira, A.A., Nascimento, C., Goncalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J. Am. Soc. Inf. Sci. Technol. 61(9), 1853–1870 (2010)
Article Google Scholar
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the Joint ACM/IEEE Conference on Digital Libraries, pp. 296–305 (2004)
Google Scholar
Han, H., Zha, H., Giles, C.L.: A model-based k-means algorithm for name disambiguation. In: International Semantic Web Conference (2003)
Google Scholar
Han, H., Zha, H., Giles, C.L.: Name disambiguation spectral in author citations using a k-way clustering method. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL, Denver, CO, USA, 7–11 June, pp. 334–343 (2005)
Google Scholar
Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inf. Process. Manag. 45(1), 84–97 (2009)
Article Google Scholar
Li, S., Cong, G., Miao, C.: Author name disambiguation using a new categorical distribution similarity. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7523, pp. 569–584. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33460-3_42
Chapter Google Scholar
Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation, pp. 33–40 (2004)
Google Scholar
Nadimi, M.H., Mosakhani, M.: A more accurate clustering method by using co-author social networks for author name disambiguation. J. Comput. Secur. 1, 307–317 (2015)
Google Scholar
On, B.W.: Social network analysis on name disambiguation and more. In: International Conference on Convergence and Hybrid Information Technology, pp. 1081–1088 (2008)
Google Scholar
On, B.W., Lee, I.: Meta similarity. Appl. Intell. 35(3), 359–374 (2011)
Article Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS, pp. 1425–1432 (2003)
Google Scholar
Quan, L., Bo, W., Yuan, D.U., Wang, X., Yuhua, L.I.: Disambiguating authors by pairwise classification. Tsinghua Sci. Technol. 15(6), 668–677 (2010)
Article Google Scholar
Tan, Y.F., Kan, M.Y., Lee, D.: Search engine driven author disambiguation. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL, Chapel Hill, NC, USA, 11–15, June, pp. 314–315 (2006)
Google Scholar
Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans. Knowl. Data Eng. 24(6), 975–987 (2011)
Article Google Scholar
Yang, K.-H., Peng, H.-T., Jiang, J.-Y., Lee, H.-M., Ho, J.-M.: Author name disambiguation for citations using topic and web correlation. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 185–196. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87599-4_19
Chapter Google Scholar
Yin, X., Han, J., Yu, P.S.: Object distinction: distinguishing objects with identical names. In: International Conference on Data Engineering, ICDE, The Marmara Hotel, Istanbul, Turkey, April, pp. 1242–1246 (2007)
Google Scholar
Zepeda-Mendoza, M.L., Resendis-Antonio, O.: Hierarchical agglomerative clustering. Encycl. Syst. Biol. 43(1), 886–887 (2013)
Article Google Scholar
Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)
Article Google Scholar
Zhu, J., Fung, G., Wang, L.: Efficient name disambiguation in digital libraries. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 430–441. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23535-1_37
Chapter Google Scholar
Zhu, J., Cheong Fung, G.P., Zhou, X.: Anddy: a system for author name disambiguation in digital library. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds.) DASFAA 2010. LNCS, vol. 5982, pp. 444–447. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12098-5_46
Chapter Google Scholar
Zhu, J., Zhou, X., Fung, G.P.C.: A term-based driven clustering approach for name disambiguation. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, Q.-M. (eds.) APWeb/WAIM -2009. LNCS, vol. 5446, pp. 320–331. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00672-2_29
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by the Natural Science Foundation of Guangdong Province, China (2015A030310509), the Public Research and Capacity Building in Guangdong Province, China (2016A030303055), the Major Science and Technology projects of Guangdong Province, China (2016B030305004, 2016B010109008, 2016B010124008) and the National Natural Science Foundation of China (61272067).

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, China
Xueqin Lin, Jia Zhu, Yong Tang, Fen Yang, Bo Peng & Weiling Li

Authors

Xueqin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Fen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Peng
View author publications
You can also search for this author in PubMed Google Scholar
Weiling Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Tang .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology , Melbourne, Australia
Zhifeng Bao
Northwestern University , Evanston, Illinois, USA
Goce Trajcevski
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Zhu, J., Tang, Y., Yang, F., Peng, B., Li, W. (2017). A Novel Approach for Author Name Disambiguation Using Ranking Confidence. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10179. Springer, Cham. https://doi.org/10.1007/978-3-319-55705-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-55705-2_13
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55704-5
Online ISBN: 978-3-319-55705-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel Approach for Author Name Disambiguation Using Ranking Confidence

Abstract

Similar content being viewed by others

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Author Name Disambiguation Based on Rule and Graph Model

Large Scale Name Disambiguation Using Rule-Based Post Processing Combined with Aminer

Keywords

1 Introduction

2 Related Work