Keywords

1 Introduction

Data discovery and reuse play an essential role in helping scientific research by supporting to find data [4, 19]. Researchers typically reuse datasets from colleagues or collaborators, and the credibility of such datasets is critical to the scientific process [11, 25]. Datasets sourced from a network of personal relationships (colleagues or collaborators) can carry limitations as they tend only to recommend datasets that they themselves find helpful [2]. However, due to the research variability, one person’s noisy data may be another person’s valuable data. Also, datasets retrieved from relational networks can be limited to certain research areas.

As an emerging dataset discovery tool, a dataset search engine can help researchers to find datasets of interest from open data repositories. Moreover, due to the increasing number of open data repositories, many dataset search engines, such as Google Dataset Search [3] and Mendeley DataFootnote 1, cover more than ten million datasets. While dataset search engines bring convenience to researchers, they also have certain limitations. Similar to general search engines, such dataset search engines require the researcher to provide keywords to drive the search; filtering, ranking, and returning all datasets based on the given keywords. In order to use a dataset search engine, researchers need to summarize the datasets they are looking for into these keywords, with the risk that they do not cover all the desired properties, and that unexpected but relevant datasets will be missed. Thus, the standard pathway “scientific items \(\rightarrow \) keywords \(\rightarrow \) scientific items sets”Footnote 2 used by existing dataset search engines has inherent limitations.

This paper proposes a recommendation method based on entity vectors trained on citation networks. This approach is a solution for data discovery following the more direct “scientific items \(\rightarrow \) scientific items” pathway. Because our approach does not require converting scientific items (papers and datasets) into keywords, we can avoid the earlier drawbacks. Furthermore, we combine this new recommendation method with existing recommendation methods into an integrated ensemble recommendation method. This paper also provides a benchmark corpus for scientific item recommendation and a benchmark evaluation test. By performing benchmark tests on randomly selected scientific items from this benchmark corpus, we conclude that our integrated recommendation method using citation network entity embedding can obtain a precision rate of about 70%.

Specifically, in this paper, we study three research questions:

  • Will a citation network help in scientific item discovery?

  • Can we do dataset discovery purely by link prediction on a citation network?

  • Will the addition of citation-network-link-prediction help for scientific item discovery?

The main contributions of this paper are: 1) we propose a method for recommending scientific items based on entity embedding in an academic citation graph, 2) we propose a benchmark corpus and evaluation test for scientific items recommendation methods, 3) we identify an ensemble method that has high precision for scientific items recommendation, and 4) we provide the pre-trained entity embeddings for our large-scale academic citation network as an open resource for re-use by others.

2 Related Work

Data reuse aims to facilitate replication of scientific research, make scientific assets available to the public, leverage research investment, and advance research and innovation [19]. Many current works focus on supporting and bringing convenience to data reuse. Wilkinson et. al. provided FAIR guiding principles to support scientific data reuse [28]. Pierce et. al. provided data reuse metrics for scientific data so that researchers can track how the scientific data is used or reused [22]. Duke and Porter provided a framework for developing ethical principles for data reuse [10]. Faniel et. al. provided a model to examine the relationship between data quality and user satisfaction [12].

Dataset recommendation is also a popular research trend in recent years. Farber and Leisinger recommended suitable dataset for given research problem description [14]. Patra et. al. provided an Information retrieval (IR) paradigm for scientific dataset recommendation [20]. Altaf et. al. recommended scientific dataset based on user’s research interests [1]. Chen et. al. proposed a three-layered network (composed of authors, papers and datasets) for scientific dataset recommendation [5].

3 Link Prediction with Graph Embedding on a Citation Network

The link prediction training method we use is KGlove [7]. KGlove finds statistics of co-occurrences of nodes in random walks, using personalized page rank. Then Glove [21] is used to generate entity embeddings from the co-occurrence matrix. In this paper, we apply KGlove on 638,360,451 triples of the Microsoft Academic Knowledge Graph (MAKG) [13] citation network (containing 481,674,701 nodes) to generate a co-occurrence matrix of the scientific items. Then we use the Glove method on this co-occurence matrix to obtain the scientific entity (item) embeddings. The trained embeddings are made available for future workFootnote 3. After training the entity embedding based on the MAKG citation network, we perform link predictions between scientific items (papers and/or datasets) by a similarity metric in the embedding space. We use cosine similarity, which is the most commonly used similarity for such embeddings.

Definition 1 (Link Prediction for scientific items with Entity Embedding)

Let \(E = \{e_1, e_2, ...\}\) be a set of scientific entities (also known as scientific items). Let \({{\,\mathrm{emb}\,}}\) be an embedding function for entities such that \({{\,\mathrm{emb}\,}}(e)\) is the embedding of entity \(e\in E\), and \({{\,\mathrm{emb}\,}}(e)\) is a one-dimensional vector of a given length.

Let \(\cos : (a, b) \rightarrow [0,1]\) be a function such that \(\cos (a,b) = \frac{{{\,\mathrm{emb}\,}}(a)\cdot {{\,\mathrm{emb}\,}}(b)}{||{{\,\mathrm{emb}\,}}(a)||\cdot ||{{\,\mathrm{emb}\,}}(b)||}\) where \(a,b \in E\).

Given a threshold t, we define Link prediction with Entity Embedding in E as a function \(LP_E:E\rightarrow 2^E\) where \(LP_E(e_s) = \left\{ r_{1}, r_{2}, ... ,r_{n} | \forall i = 1 \dots n, \cos (e_s, r_i) < t \right\} \).

4 Dataset Recommendation Methods

In this section we use the previous definition of link prediction to introduce two new dataset recommendation methods, as well as three methods from our previous work. We also propose an open-access scientific items recommendation evaluation benchmark, including corpus and evaluation pipeline (Fig. 1).

Fig. 1.
figure 1

Pipeline of Scientificset Data Recommendation and Evaluation Benchmark.

4.1 Dataset Recommendation Methods

The dataset recommendation methods in this section use a combination of link-prediction and ranking approaches to recommend a recommended scientific item based on given scientific items.

Data Recommendation with Link Prediction Using a Citation Network. This scientific entity (item) recommendation method is based on Definition 1, where a set of entities is returned such that the cosine distance between these entities and the given entity is smaller than a threshold t. Based on the list of scientific items returned by the link prediction algorithm, the recommendation method considers only the TOP-n results of that list, with the value of n to be chosen as a parameter of the method. Formally, this is defined as follows:

Definition 2

(Top-n scientific entity (items) Recommendation with Link Prediction). Let \(E = \{e_1, e_2, ...\}\) be a set of scientific entities (also known as scientific items). Let \(LP_E\) be a link prediction function using embeddings in E (see Definition 1). Top-n Scientific entity recommendation with link prediction using embedding is a function \(DRLP_{E}^n\), which maps an entity \(e_s\) to \((r_1, \dots r_m)\) which is the longest ordered list of \(m <= n\) pairwise distinct elements of \(LP_E(e_s)\) where \(\forall i = 1 \dots m - 1, \cos (e_s, r_i) <= \cos (e_s, r_{i+1})\).

In words, this function maps an entity (scientific item) to a list of at most n other entities (scientific items) which are closest to it in the embedded space, ordered by the distance.

We can now combine this general definition with a specific embedding function \({{\,\mathrm{emb}\,}}\) to create a specific link-prediction-based recommendation method. In particular, we use KGloVe embeddings from the MAKG citation network to create a recommendation method based on link prediction from a citation network.

Scientific Items Recommendation with BERT-based Link Prediction. The method from the previous subsection used the embeddings computed on the citation graph to determine similarity between data items. This is a plausible choice, since we can expect the MAKG citation graph to give us a reasonable signal for similarity in the scientific domain: it captures the scientific relationships between items in the science domain. In contrast to this, we also experimented with using other models to compute the similarity between items. In particular, we used the pretrained BERT model [9, 23] as an example of a cross-domain model to see if such a generic pretrained model would also suffice to compute the similarity metric that is the basis for our link-prediction-based recommendation algorithm. The pretrained BERT model used in this paper is the all-mpnet-base-v2 model from the SentenceTransformers Python libraryFootnote 4. Such BERT-based link prediction for scientific items is obtained by applying the pretrained BERT model to the descriptive metadata of the scientific items to obtain the BERT embedding of scientific items. Such metadata consists of the title of the dataset and a short text that accompanies the dataset. Then, we apply the BERT embedding of the scientific items to Definition 2 to do scientific items recommendations.

Scientific Items Recommendation with BM25-based Data Ranking. BM25-based Data Ranking is the recommendation approach provided in our previous paper [27]. Given a seed scientific item, we rank the list of candidate recommended scientific items using the popular BM25 method from information retrieval according to the descriptive metadata of the scientific items (consisting of title and textual description), where a higher ranking position means a better recommendation [24].

Scientific Items Recommendation with Graph Walk. The co-author network-based graph walk method is also a scientific items recommendation method that we have previously proposed in [26]. Such a graph walk on a co-author network performs the recommendation task according to the “scientific items \(\rightarrow \) author \(\rightarrow \) co-author network \(\rightarrow \) author \(\rightarrow \) scientific items” pathway. In order to reduce the number of candidate recommendations we only consider items connected to authors within an n-hop distance to the author of the seed data item in the co-author network.

Dataset Recommendation with Pre-trained Author Embedding. Similar to the method based on citation-based embeddings, we have proposed in earlier work [26] a recommendation method for scientific items based on pre-trained co-authorship embeddings. This approach is similar to our proposed method using embeddings from the MAKG citation network (Definition 1), but uses embeddings computed from the MAKG co-author network instead.

5 Scientific Items Recommendation Benchmark

To evaluate the performance scientific items recommendation methods, we propose here an open-source generalized benchmark corpus and process for scientific items recommendation. scientific items in general can be publications, datasets, graphs, tables, geographic data, etc.

Table 1. Statistics of benchmark corpus

5.1 Benchmark Corpus

The benchmark corpus is an HDT/RDF graph [15, 18] stored as triples of the form “[scientific item] [link] [scientific item].” The scientific items are the intersection of scientific items in ScholeXplorerFootnote 5 and MAKG (Microsoft Academic Knowledge Graph). This intersection is computed by matching the DOI of scientific items (datasets and/or papers) between ScholeXplorer and MAKG. We have chosen to represent all the scientific items by the identificatier used in the Microsoft Academic Graph (MAG). With help of these MAG identifiers, the information (such as title, providers, publishers, or creators) of scientific items is easily accessible in MAKG. The bi-directional links between these items are from ScholeXplorer and all the links are provided by data sources managed by publishers, data centers, or other organizations.

In Table 1, we show the statistics of our benchmark corpus. There are more than 3 million items and more than 15 million bi-directional links between them. We provide the data subset with only bi-directional links between scientific papers, consisting of 2.9 million scientific papers and 14.3 million links between them. We also provide the data subset of only the bi-directional links between scientific datasets, with 1,544 scientific items and 2,335 million links between them. We have made this corpus available at https://zenodo.org/record/6386897.

5.2 Benchmark Evaluation

The goal of our benchmark is to evaluate the performance of scientific item recommendation methods on all datasets in the benchmark corpus, with the option to only use a randomly selected subset. We use the F1-measure method [6] to evaluate the performance of recommendation methods on reconstruction of bi-directional links between scientific items. The F1-measure method consists of three evaluation metrics: recall, precision and F1-score. Recall is the percentage of recommendations (i.e. links as given in the dataset that start from the seed data) that the recommendation method can recommend. Precision is the percentage of scientific items recommended by the recommendation method that is correct (i.e., present in the standard). Finally, the F1-score is the harmonic mean of recall and precision.

6 Experiments and Results

This section will present the setup and results of our experiments on the proposed recommendation methods from Sect. 4 using the evaluation benchmark from Sect. 5. The implementation of recommendation methods and the code of all experiments could be found at https://github.com/XuWangVU/datarecommend.

6.1 Experimental Setup

We set up three evaluation experiments using three sets of data randomly selected from the benchmark corpus. The statistics of the selected data are shown in Table 2. For each seed scientific item, we look for recommendations among all the candidate scientific items and return a sorted subset of these candidates.

Table 2. Statistics of experiments

The recommendation methods evaluated in the experiments comprise the five methods described in Sect. 4. Beyond these single methods, we also tested ensemble methods by combining multiple methods to make recommendations. All methods (including the ensemble methods) fall into two types of pathway-based categories: pathways with author and pathway without authors. All the methods (including the ensemble methods) used in our experiments can be found in Table 3.

Table 3. Scientific items recommendation methods used for experiments.

We use thresholds for two methods: a distance threshold for graph walks and a threshold for similarity between author embeddings. The distance threshold for graph walks is the maximum number of hops that make up a graph walk. For example, hop1 means that only authors with a distance of 1 from the given author are considered. The author embedding similarity threshold means that only authors with an embedding similarity greater than or equal to the threshold with the given author are considered.

Each recommendation method is assigned a parameter. For the graph walk method, we use the parameter of hop1, hop2, or hop3, to represent the distance threshold used for graph walk. For the similarity method between pretrained MAKG author embeddings, we use similarity threshold parameters ranging from 0.3 to 0.7, increasing in steps of 0.1. For the BM25-based ranking method, we use the parameter \(p_{bm25} = 2*outdegree(seed)\), where outdegree(seed) is the number of scientific items linked from the seed in the benchmark corpus. In other words, we will only consider the top \(p_{bm25}\) results in the list returned by the ranking method. For both link prediction methods using citation network embeddings and BERT-based link prediction methods, we use a parameter of 0.8, which means we only consider the top 80% of the sorted lists returned by both methods.

6.2 Experimental Results

Fig. 2.
figure 2

Precision comparison in Experiment 1, Experiment 2, and Experiment 3.

Table 4 show the results of the scientific items recommendation methods which do not consider authors in the pathway, while Tables 5, 6 and 7 show the results of methods considering authors. We use color-coding of the cells to indicate different ranges of values: Red means relative poor performance in comparison with related settings; green code means outstanding performance in comparison; and yellow means average performance.

Table 4. Results of Experiment(EXP) 1, 2 & 3 without graph walk and author embedding.

In the experiments which do not consider authors, we found that recall, precision, and F1-score were usually not high, except for the method which only uses BERT, where we could obtain a recall of over 0.95. However, this situation does not achieve sufficiently high precision rates.

When the author network is taken into consideration, the precision rate improves considerably, and in some integrated methods, we achieve precision results of 0.7 or even 0.8. Unfortunately, these high precision rates come with a decreased recall rate, which means that the methods return few, but often correct recommendations.

This behavior, i.e., high precision rates at relative low recall, is typical and sufficient for recommendation engines. Hence, we explore these results in more detail. A comparison of the precision rates of the different methods can be found in Fig. 2. For experiment 1, we observe little variability, likely due to the small data size. For experiments 2 and 3, however, the precision rate increases with a higher distance threshold for the graph walk or with a higher threshold for the author embedding similarity.

Based on the comparison of the results of the different methods in Tables 5, 6 and 7 and Fig. 2, we can conclude that all recommendation methods that use data ranking (BM25) or link prediction (Citation Embedding) have a high precision on our scientific items recommendation benchmark experiments when using graph walking and author embedding similarity methods in an ensemble of methods.

7 Conclusion and Discussion

In this paper, we have investigated the use of a large scale citation network for the purposes of recommending scientific items, on the basis of a given scientific item by the user, according to the well-known paradigm “if you like this dataset, you might also like these other datasets”. The method uses low-dimensional vector space embeddings computed from the citation graph in order to compute the cosine similarity between datasets as the basis for its recommendations. By itself, this method performed unsatisfactorily on our benchmark under a variety of experimental settings.

We therefore also studied the behaviour of this method in an ensemble with a number of other methods: recommendations based on n-hops walks in a co-author graph (\(n=1,2,3\)), recommendations based on embeddings computed over this co-author graph, recommendations based on the BERT large language model, and the BM25 method from information retrieval. We studied a large variety of the most promising combinations of methods under different experimental settings. In our largest experimental setting, the ensemble methods that used the embeddings from the citation network outperformed those that didn’t, with a precision of 0.64 under a variety of settings. This acceptable precision in a recommendation setting comes at the price of a low recall, a behaviour that is typical in recommendation engines.

This allows us to succinctly answer the research questions we formulated in the introduction of this paper:

  • Will a citation network help in dataset discovery? Answer: yes

  • Can we do dataset discovery purely by link prediction on a citation network? Answer: no

  • Will the addition of citation-network-link-prediction help for dataset discovery? Answer: yes

We performed our experiments on a newly constructed benchmark set, using the KGlove method for training scientific entity (item) embeddings from the Microsoft Academic Knowledge Graph, containing a citation network of 100 million edges. We have made this benchmark corpus available online.

The methods that we designed and evaluated in this paper are clearly not the final word on how to recommend scientific items. Likely, the results can be improved not only by using tuning parameters to specific datasets, but also by adding other existing applicable methods. Also, the dataset could be expended. We have used both citation and co-author networks as signals for academic similarity, but also other academic networks exist. Including those is subject of future work.

The link prediction mentioned in this paper uses pre-trained embedding models. One drawback of this type of models is that this requires an embedding for each entity in the graph, and hence many existing models do not scale well enough. In the future several approaches could be investigated to overcome, one option is to use a model which can work in an inductive setting, based on the description, or even the content of the datasets. An example of such a method is BLP [8]. To reduce the number of embeddings, we could also use a model which only keeps embeddings for some entities in the graph, like NodePiece [16]. Another direction could be to attempt scaling models using summarization, as was done in [17].