Keywords

1 Introduction

At present, there are several literature retrieval platforms in the world such as China National Knowledge Infrastructure (CNKI), DBLP, CiteSeer, PubMed, etc. The content and quality of the digital library are seriously affected by the ambiguity of author’s name, which is regarded as one of the most difficult issues facing digital library [1]. Therefore, how to reduce the impact due to the name ambiguity, and maximize the effectiveness of the digital library, has become a concern of researchers. The “Name Disambiguation” began to be raised and attracted the attention of a large number of experts and scholars.

Name Disambiguation, also known as Entity Resolution [2, 3], Name Identification [4], which mainly solves the problem of name coreference and name ambiguity. The name coreference problem mainly appears in the English digital library. It is common that a single author has multiple names in digital library. For example, a possible form of author names A. Lim is Andrew Lim, Abel Lim, etc. The name ambiguity problem common that different authors may share identical names in the real world. For example, there are 57 papers authored by 2 different “Alok Gupta” in the DBLP database.

A lot of work has been studied for Name Disambiguation. For example, Shen, et al. [5] present a novel visual analytics system called NameClarifier to interactively disambiguate author names in publication. However, NameClarifier still heavily relies on human beings’ subjective judgments. Kim, et al. [6] used Random Forest to derive the distance function and obtained a good accuracy rate, but the training set required a lot of manual labeling while the model have poor migration. Lin et al. [7] proposed an approach only use the coauthor and title attributes, but they did not consider the coreference problem. Xu et al. [8] considered that each kind of single feature has very strong fuzziness in the expression and used a similarity algorithm. However, many feature inability to correctly match due to the different writing formats between two same strings.

This paper mainly proposes an algorithm called Author Name Disambiguation based on Molecular Cross Clustering (ANDMC) considering name coreference. Meanwhile, we propose the string matching algorithm called Improved Levenshtein Distance (ILD), which solves the problem of matching between two same strings with different writing format. The experimental results show that our algorithm outperforms the baseline method. (F1 value 9.48% 21.45% higher than SC and HAC).

The structure of this paper is as follows: In Sect. 2, we introduce the related research work of name disambiguation. In Sect. 3, we introduce the core of this article including the similarity calculation method of the author name disambiguation and merging procedure. In Sect. 4, we describe our experiment and verify the proposed method. In Sect. 5, we summarize the method proposed in this paper. This part also addresses the shortcomings of the method and its ideas for future improvement.

2 Related Work

The problem of name ambiguity often appears in the literature retrieval platforms, digital library and other similar systems, which has become one of the main problems in the fields of information retrieval, data mining and scientific measurement. [9] The “Name Disambiguation” which mainly solves the problem of name coreference and name ambiguity began to be raised.

The name coreference problem mainly appears in the English digital library. Newman et al. [10] proposed a heuristic method for complete matching the first letter of the last name and the first name, but some authors is the same as the spelling but different name such as “M. Li”, “Min. Li” and “Ming. Li” are merged to reduce the accuracy.

The name ambiguity problem common that difference authors may share identical names in the real world. In general, existing methods for name disambiguation mainly fall into three categories: supervised based [11, 12], semi-supervised based [13] and unsupervised based [14,15,16,17]. The supervised based method has a high accuracy rate, but the training of massive data requires a lot of manual labeling, which is time-consuming and labor-intensive. What is more, with the advancement of time, the data iteration is rapid. Therefore, the supervised based method has poor portability. Semi-supervised based method use user’s feedbacks to get more useful information, but when the amount of data is very large, the user feedbacks information are very difficult to collect and also expend much manpower and material resources in the process of collecting [7]. The biggest advantage of the unsupervised based method is that it does not require a lot of training data and training time. On large-scale data, no method is more feasible and scalable than the unsupervised based method.

The factors that determine the performance of unsupervised based method, not only by the clustering algorithm but also by the calculation of similarity. On the problem of name ambiguity, both the selection of features and how to use these features to calculate similarity are as important as the choice of clustering algorithm. Shin et al. [18], Fan et al. [19] Kang et al. [20] selected coauthor relationships as features, but the author who has not coauthor cannot be distinguish. Lin et al. [7] proposed an approach only use the coauthor and title attributes, but they did not consider the coreference problem. Xu et al. [8] considered that each kind of single feature has very strong fuzziness in the expression and used a similarity algorithm. However, many feature inability to correctly match due to the different writing formats between two same strings.

Based on the previous research results, this paper further studies the Name Disambiguation. The main contributions of this paper can be summarized as follows:

  1. 1.

    Propose the string matching algorithm called Improved Levenshtein Distance (ILD), which solves the problem of matching between two same strings with different writing format. (F1-score 13.08% higher than LD).

  2. 2.

    Propose an algorithm called Author Name Disambiguation based on Molecular Cross Clustering (ANDMC) considering name coreference. (F1-score 21.45% higher than SC, F1-score 9.48% higher than HAC).

3 Proposed Approach

This paper proposes a molecular cross clustering method. The Fig. 1 shows the process of molecular cross clustering. We regard each paper as an atom. Firstly, these papers are classified according to author’s name, while keep the associated category records, and perform atom clustering [21] in the same category to form a molecular. Calculate the molecular similarity between molecular according to the associated category records differentiated by the standard segmentation feature values, and finally obtain the classification result. Each time extract the feature of the previous merge result, which could effectively increase the data amount of the corresponding feature and improve the accuracy of the merge.

Fig. 1.
figure 1

The process of molecular cross clustering.

The Table 1 lists five records containing the authors of paper, title of paper, and affiliation of paper. It is difficulty for us to make sure that the author “Andrew Lim” is the same person. According to our algorithm, firstly, we can divide this paper into two major categories, Andrew Lim {{1}, {2}, {4}, {5}} and A. Lim {3}. Secondly, it is difficult to directly judge whether Andrew Lim in 1 and 2 is the same person, but 1, 4 have the same collaborator Zhou Xu. After merge 1, 4 we can find that Hu Qin, who is the same collaborator with 2 that means it has a higher probability that 1, 2 are the same person. In the same way, we can easily get the set {1, 2, 4, 5}. At this time, calculate the similarity between the set {1, 2, 4, 5} and {3}, we can find that they have the same collaborator “Fan Wang”, the same institution and the similar titles, etc.

The steps of algorithm for Author Name Disambiguation based on Molecular Cross Clustering as follow:

  1. 1.

    Data processing

  2. 2.

    Solve the problem of name ambiguity

    1. (a)

      Node relationship division

    2. (b)

      Affiliation string matching

  3. 3.

    Solve the problem of name coreference

    1. (a)

      Similar name cross match

Table 1. An example of name disambiguation.

3.1 Data Processing

Perform pre-processing operations such as integration, cleaning and de-duplication on the data to obtain initial data. Each piece of paper in the initial data as an atom.

Extract the following feature attributes for each paper P:

$$\begin{aligned} P = (A,\quad T,\quad I) \end{aligned}$$
(1)

Where A represents the author of the paper, T represents the title of the paper, and I represents the affiliation of the paper.

We treat each paper as a node, let n be a name entity, denoted as n, and for the name n, its variant is denoted as Vn = V1, V2, ..., Vm, where the variant of n include the abbreviated forms, last name and first name rotated form, the change of connection symbol and combinations of them [22]. The set of papers corresponding to the name Vn is denoted by the set Pn = p1, p2, ..., pk, where pi = s1, s2, ..., sk represents a set of all papers containing the author names Vx. Ai = a1, A2, ..., ak represents the author set corresponding to the papers set pi. Ni = n1, n2, ..., nk represents the set of the same name authors corresponding to Ai.

3.2 Node Relation Division (NRD)

In the research of the name disambiguation, the relationship of cooperation between nodes has a strong influence on the correct division of nodes [20]. For two nodes with the same name attribute, if they all have a cooperative relationship with another node, the two nodes have greater similarity.

The set of collaborators of the name Ni can be denoted as:

$$\begin{aligned} C_i = A_i-N_i = \{a1-n1,a2,n2,\ldots ak-nk\}= \{ c1,c2,\ldots ,ck \} \end{aligned}$$
(2)

Traversal the set N\(_i\), each n\(_i\) as a node. Traversal the set C\(_i\), the author in each set c\(_i\) generates a node which has a cooperative relationship with the node ni. We use the graph database to generates the author relationship network, and finds the number of connections of the author ni to nj denoted as Num(Lij), according to the Jaccard coefficient similarity function, the similarity between the node ni and the node nj is:

$$\begin{aligned} sim(n_i,n_j)=\frac{Num(L_{ij})}{|c_i \cup c_j |} \end{aligned}$$
(3)

When the similarity is greater than the threshold value, ni and nj will be merged.

3.3 Affiliation String Matching (ASM)

The main difficulties in matching affiliation string for English databased is that affiliation write different formats. For example, there have four affiliations as follows: “IBM India Res. Lab, New Delhi”, “IBM India Research Laboratory”, “IBM India Research Lab, New Delhi, India” and “IBM India Research Lab, New Delhi, India 110 070”. It is clearly shown that the above four affiliations belong to the same affiliation, but the writing in different formats which lead the computer cannot match them together correctly. At present, there are many similarity algorithm for string matching, such as Jaccard algorithm, Euclidean Distance, Levenshtein Distance, etc. However, the calculation of the whole affiliation string is not satisfactory. For example, two affiliations as follows:

  1. 1.

    “School of Electrical Engineering & Automation, Henan, Polytechnic University, Jiaozuo, People’s Republic of China”

  2. 2.

    “Department of Electrical Engineering and Automation, Tianjin University, Tianjin, People’s Republic of China”

If we directly calculate the similarity of the affiliation names, it is likely to judge them as the same affiliation, but they are not the same affiliation actually. There is also a problem with the calculation of Levenshtein Distance. For example, there are two strings include word “Research” and “Res”, the Levenshtein distance of two words is 5, and the similarity is 40%. We find that, in reality, these two words actually belong to a same word. In order to solves the problem of matching between two same strings with different writing format while enhance the accurate of similarity calculate. In this paper, we cut each word in the affiliation. We optimize Levenshtein Distance algorithm as ILD (Improved Levenshtein Distance algorithm) to calculate the similarity of each word. For the affiliation X and the affiliation Y, cut through the separator to obtain the set X = x1, x2, xp and set Y = y1, y2, yq. Construct the relational matching matrix E with the number of rows p and the number of columns q:

$$\begin{aligned} E_{pq} =\{sim(i,j)\} \end{aligned}$$
(4)

For each xis1, s2, ..., sm, yjs1, s2, ..., sn construct the relationship matching matrix LD between xi and yj whose row number is m+1 and column number is n+1. The first column of the matrix represents X, and the first row represents Y:

(5)

Fill the relationship matching matrix LD according to the following formula:

$$\begin{aligned} ld_{ij}= \left\{ \begin{array}{lr} i&{}j=0\\ j&{}i=0\\ min(ld_{i-1 j-1},ld_{i-1 j},ld_{i j-1}) + 1 \quad &{} i,j>0,x_i \ne x_j \\ ld_{i-1 j-1}&{}i,j>0,x_i = x_j \\ \end{array} \right. \end{aligned}$$
(6)

After fill the matrix LD, the element dmn is the edit distance between xi and yj, which is recorded as:

$$\begin{aligned} d(x_i,y_i)= \left\{ \begin{array}{lr} d_{min(m,n)min(m,n)}\quad &{}x_i \in y_j \;or \; y_j \in x_i\\ d_{mn}&{}else\\ \end{array} \right. \end{aligned}$$
(7)

The similarity sim(xi, yj) is calculated as:

$$\begin{aligned} sim(x_i ,y_j)=1-\frac{d(x_i,y_i)}{max(len(x_i),len(y_j))} \end{aligned}$$
(8)

Where len(xi) and len(yj) are the lengths of the string xi and the string yj, respectively. When sim(xi, yj) = 1, the string xi and yj exactly match. For the matrix Epq, if exist at least one sim(xi, yj) = 1 on the p-row or q-column, we think that the affiliation X and the affiliation Y have one word exactly matched which is recorded as:

$$\begin{aligned} CM(k)= \left\{ \begin{array}{lr} 1\quad &{}exist sim(x,y) = 1\;in\; link\; k\\ 0&{}else\\ \end{array} \right. \end{aligned}$$
(9)

The similarity of the word exactly match in the affiliation X and Y as follows:

$$\begin{aligned} sim(X,Y)_{cm}=\frac{average(\sum _{p}CM(p),\sum _{q}CM(q))}{average(p,q)} \end{aligned}$$
(10)

The similarity of the word non-exactly match in the affiliation X and Y as follows:

$$\begin{aligned} \begin{aligned} sim(X,Y)_{other}&= \frac{average(\sum _{p}max(sim(X,Y)),\sum _{p}max(sim(X,Y)))}{average(p,q)} \\&-\frac{average(\sum _{p}CM(p),\sum _{q}CM(q))}{average(p,q)} \end{aligned} \end{aligned}$$
(11)

The similarity between the affiliation X and Y is:

$$\begin{aligned} sim(X,Y)= sim(x,y)_{cm} \times W_1+ sim(X,Y)_{other} \times W_2 \end{aligned}$$
(12)

3.4 Similar Name Cross Match (SNCM)

For a name entity n, each variant in Vn = V1, V2, ... Vm has solved the problem of name ambiguity. This part mainly solves problem of name coreference. We need to calculate the similarity between each Ai in Vx, which denoted as “Vx.Ai” and each of Aj in Vy, which denoted as “Vy.Aj”. We calculate the corresponding similarity Sx according to the features A, T, I, and set the weight W, respectively. The similarity between Vx.Ai and Vy.Aj are as follows:

$$\begin{aligned} S(V_x.A_i,V_y,A_j)= \left\{ \begin{array}{lr} S_A=S_{JACCARD}(V_x.A_i,V_y.A_j) \\ S_T=S_{LD}(V_x.T_i,V_y.T_j)\\ S_I=S_{ILD}(V_x.I_i,V_y.I_j))\\ \end{array} \right. \end{aligned}$$
(13)

We chose to put similar name cross matching in the last step due to the current similarity calculation does not guarantee 100% accuracy for authors with the same name and a large number of duplicate names. Since each of our mergers is based on the previous step. As a result, we must ensure that the accuracy of the previous merge is as high as possible. If this step is advanced, it will greatly affect the accuracy of the subsequent steps.

4 Experiments

4.1 Data Sets

In our experiments, we perform evaluations on a dataset constructed by Tang et al. [21], which contains the citations collected from the DBLP Website. We downloaded this dataset from the Kaggle. However, the data set is only labelled within the same name range, and the name containing the abbreviation is less. Therefore, we add some real intellectual property disclosure data on the basis of this data set to verify our method. Select some authors as experimental samples. When evaluating the classification results, we use the author whose name is prone to the same name as a sample. We use manual methods to create standard categories. The process is as follows: For each author name in Table 2, we retrieve all the papers published by the name in the database. Classify the authors of the same name by human annotated, as best as possible to accurately.

Table 2. Evaluation dataset.

4.2 Evaluation Indicators

To evaluate and compare the performance of different methods on the Name Disambiguation tasks. In this paper, we use pairwise precision, pairwise recall and pairwise f1-measure to measure the results. We define the measures as follows:

$$\begin{aligned} PairwisePrecision = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsPredictedToSameAuthor} \end{aligned}$$
(14)
$$\begin{aligned} PairwiseRecall = \frac{\#PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsToSameAuthor} \end{aligned}$$
(15)
$$\begin{aligned} PairwiseF-Measure = \frac{2 \times PairwiseRecall \times PairPrecision}{PairwiseRecall + PairPrecision} \end{aligned}$$
(16)

In the above formula, #PairsCorrectlyPredictedToSameAuthor refers to the number of papers that with the same label predicted by an approach and have the same label in the human annotated data set. #TotalPairsPredictedToSameAuthor refers to the number of papers that with the same label predicted by an approach. #TotalPairsToSameAuthor refers to the number of papers that have the same label in the human annotated data set.

4.3 Experimental Results

We considered the baseline methods on LD algorithm. In this step, we only evaluate based on the feature of affiliation, and do not evaluate the results based on other feature. Table 3 shows the results of some examples in our data sets.

Table 3. Table captions should be placed above the tables.

Obviously, it can be seen from the experimental results that the ILD algorithm has a better improvement than the LD algorithm in each evaluation value (+17.76% over LD by average F1 score, +24.53% over LD by average Precision). On the other hand, our method has higher precision than baseline methods (+18.3% over SC, +8.51% over HAC by the average Precision value).

According to the name similarity matching, the number of names existing in each name set as follows (Table 4):

Table 4. Evaluation dataset.

In this paper, we considered several baseline methods based on Hierarchical Agglomerative Clustering (HAC) [23, 24] and single-clustering (SC) [20]. SC only uses the feature of collaborator for disambiguation. HAC uses Jaccard Similarity and ILD Similarity algorithms with the feature of author’s name, affiliation, and collaborator. For a fair comparison, we use the same threshold for the same attribute feature. For each feature, we compare and select the thresholds to ensure that the highest recall rate based on the precision as high as possible. Table 5 gives the threshold values of features.

Table 5. Threshold values of features.

Table 6 gives the results of some examples in the data set. Obviously, our method outperforms the baseline method in name disambiguation (+21.45% over SC, +9.48% over HAC by average F1 score). On the other hand, our method has higher precision than baseline methods (+18.3% over SC, +8.51% over HAC by the average Precision value).

Table 6. Results of name disambiguation.

5 Conclusion and Discussion

Name Disambiguation in the digital library is an important task because different authors can share the same name, and an author can have many name variant. This paper mainly proposes an algorithm called Author Name Disambiguation based on Molecular Cross Clustering (ANDMC). We have also explored a string matching algorithm called Improved Levenshtein Distance (ILD). Experimental results indicate that the proposed method significantly outperforms the baseline methods. It’s performance in the problem of name coreference is quite satisfying. Meanwhile, we solve the problem of matching between two same strings with different writing format. In the future, we will pay more attention to the speed of the algorithm and improve the efficiency of the algorithm.