Keywords

1 Introduction

In order to be considered as Linked Data, the datasets published on the web have to be connected, or linked, to other datasets [1]. The RDF links such as owl:sameAs between datasets are fundamental for Linked Data as they connect data islands into a global data space so-called Web of Data. Data linking [2] can be formalized as an operation, which takes two Linked Data dataset as input and produces a collection of links between entities of the two datasets as output. When a new dataset was published as Linked Data, the publisher should check all the datasets in the Web of Data to identify the possible links, which is very time-consuming. So if there are some technology can be utilized, being recommended based on known links and focusing on those datasets most likely to link, one can sharply reduce the computational costs if the recommendations are accurate enough.

In the Web of Data, an increasing number of owl:sameAsFootnote 1 statements have been published to support merging distributed descriptions of equivalent RDF resources from different datasets. The owl:sameAs property is part of the Web Ontology Language (OWL) ontology [3], the official semantics of owl:sameAs is: an owl:sameAs statement indicates that two URI references actually refer to the same thing. When all of these owl:sameAs statements are taken together, they form a very large directed graph connecting Linked Data datasets to each other. Due to the outstanding role of owl:sameAs as the most widely used linking predicate [4], we focus on recommendation of datasets for sameAs interlinking. Previous works [58] mostly did not distinguish RDF link types when identifying datasets for interlinking, and experiments were conducted on the experimental data constructed from RDF links of various types, while the graphs formed from various types of RDF links exhibit different characters [4]. Previous works would be of less help for real application scenarios, as dataset publishers still do not know what kinds of RDF links can be established furthermore how to configure the data linking algorithms. Due to the limitations of previous methods, it is necessary to find better ways.

In this paper we try to tackle the problem of identifying more datasets that can be established owl:sameAs links with, when the publisher’s dataset has already linked to a few datasets. This is the scenario that the Recommender systems [11] techniques can be applied. We construct user-item matrix with rating values depending on the number of owl:sameAs RDF link triples between datasets from newly updated LOD Cloud 2014 dump [4]. Several classical collaborative filtering methods of Recommender systems are applied. Utilizing the semantic schema information extracted from Linked Data datasets, we define dataset semantic similarity to replace the original similarity component of memory-based collaborative filtering methods to develop our customized recommenders. To evaluate the recommenders, we conduct two experiments for assessing rating and top n recommendation accuracy. Experimental results demonstrate our customized recommenders perform much better than the original ones. The MAEs are only half of the original ones, the values are low to the range of (0.3, 0.5) on a rating scale of 1 to 7. The F-Measures are almost twice higher, the values are within the range of (0.2, 0.5), which are promising given the large set of datasets to recommend from. This drastic improvement are liable on the peculiar properties of the merging of dataset semantic similarity and memory-based collaborative filtering recommenders. The source codes and experimental data have been uploaded to GithubFootnote 2.

The rest of the paper is organized as follows. In Sect. 2, at first we describe the framework which consider the dataset identification problem as a Recommender systems problem and how we construct user-item rating matrix. Then we describe the collaborative filtering technologies we used upon the problem. At last we define a dataset semantic similarity algorithms used for injecting domain-specific information. In Sect. 3, we describe the experiments data, evaluation methodology, and results. In Sect. 4, we present related works. Finally, in Sect. 5 we conclude the paper.

2 Recommender Systems Techniques

We model the problem of identifying target dataset for sameAs interlinking as a Recommender systems problem, and we describe how to construct user-item rating matrix which is necessary for recommendation algorithms in Sect. 2.1. Several representative collaborative filtering algorithms we employed are briefly described in Sect. 2.2. Also we define a dataset semantic similarity algorithm as the similarity computation component of memory-based recommenders in Sect. 2.3.

2.1 Recommendation Framework

Recommender systems are personalized information agents that attempt to predict which items out of a large pool a user may be interested in. The user’s interest in an item is expressed through the rating the user gives the item. Generally, the interaction between user and item is represented with a user-item rating matrix. A recommender system has to predict the ratings for items that the user has not yet seen. With these estimated ratings the system can recommend the items that have the highest estimated rating to the target user. Note that item is a general term used to denote what the system recommends to users, and can be of any type, like movies, books, websites, or news articles. In our case, these items are Linked Data datasets available in the Web of Data. We use \(U=\{u_1,u_2,u_3,...,u_n\}\) to denote the set of dataset publishers (users), \(D = \{d_1,d_2,d_3,...,d_m\}\) for the set of datasets (items). We view that each dataset \(d_i\) is published by a unique publisher \(u_i\), this makes \(n = m\). This may not be hold in real world, but actually \(u_i\) is merely an identifier of dataset \(d_i\) in the publishers set U, which makes the representation to be understood easily in a Recommender systems scenario. And we denote R as an \(n \times n\) matrix of ratings \(r_{i,j}\), with \(i \epsilon \{1,...,n\}, j \epsilon \{1,...,n\}\). Recommender algorithms are used to predicting the rating values of a certain dataset publisher for the datasets he or she has not linked, or recommending a ranked list of datasets he or she might want to link according to the rating values predicted.

We aggregate all owl:sameAs RDF links by dataset, meaning that we consider dataset publisher (user) of dataset a has a rating for dataset b if there exists at least one owl:sameAs RDF link triple from dataset a which contains the subject of the triple to the dataset b which contains the object. We find that some Linked Data dataset publisher did not choose the standard http://www.w3.org/2002/07/owl#sameAs as linking predicates, but use terms from proprietary vocabulary, such as http://www.abes.fr/owlsameAs, even mistakenly used http://www.w3.org/2002/07/owlsameAs, http://www.w3.org/2000/01/rdf-schema#sameAs, we also extract links defined by these predicates. Since we can view that a dataset is sameAs interlinked to itself, the number of RDF link triples equals to the number of entities defined in the dataset. Rating values are set based on number of owl:sameAs RDF link triples, the rating value equals to the number of digits of link triples count. We illustrate the construction of rating matrix from datasets interlinking with the example as Fig. 1. For example, dataset \(d_1\) has 243 RDF links to dataset \(d_2\), the corresponding matrix entry \(r_{1,2}\) equals to 3.

Fig. 1.
figure 1

An example to illustrate how to construct rating matrix from datasets interlinking network. (a) is an example of interlinking network of five datasets, identified by d1, d2, ..., d5. The number inside the circle is the entities number of each dataset. The arrows represents owl:sameAs RDF links between datasets with a number of RDF links triples count. (b) is an example of user-item bipartite graph constructed from (a), each user has a link set to itself in item set. Generated rating matrix is shown in (c), and it is a \(5\times 5\) matrix.

2.2 Collaborative Filtering Recommendation

Collaborative filtering is widely implemented and the most mature recommendation technique. The concept is to make correlations between users or between items. There are memory-based and model-based techniques [11].

Memory-Based Recommendation. Memory-based recommenders can be divided into: user-based and item-based recommenders. The main idea of user-based algorithms is simply as follows: given a ratings matrix and the ID of the current (active) user as an input, identify nearest users that had similar preferences to those of the active user in the past. Then, for every item i that the active user has not yet seen, a prediction is computed based on the ratings for i made by the nearest users:

$$\begin{aligned} {r}_{a,i} = \sum _{b\epsilon {N}_{a}}^{}sim(a,b)\times {r}_{b,i} \end{aligned}$$
(1)

The similarity measure between user a and b, sim(a,b), is essentially a distance measure and is used as a weight. Different algorithms can be used to compute the similarity, such as cosine, Pearson, Spearman, Euclidean distance, et al.

The main idea of item-based algorithms is to compute predictions using the similarity between items rather than the similarity between users. An item-based algorithm computes a weighted average of these other ratings as the following with sim(i,j) is computed similar to what we did in user-based recommender:

$$\begin{aligned} {r}_{a,i} = \sum _{j\epsilon {S}_{i}}^{}sim(i,j)\times {r}_{a,j} \end{aligned}$$
(2)

Matrix Factorization Models. Matrix factorization models [12] is a model-based method which maps both users and items to a joint latent factor space of dimensionality f. Accordingly, each item i is associated with a vector \(q_i\epsilon R_f\), and each user u is associated with a vector \(p_u\epsilon R_f\). For a given item i, the elements of \(q_i\) measure the extent to which the item possesses those factors, the same is true for a user. The resulting dot product, \({{q}_{i}}^{T}{p}_{u}\), approximates the rating of user u for item i. To learn the factor vectors (\(p_u\) and \(q_i\)), the system minimizes the regularized squared error on the set of known ratings:

$$\begin{aligned} \min _{{p}^{*},{q}^{*}}=\sum _{u,i\epsilon K}^{}{({r}_{u,i}-{{q}_{i}}^{T}{p}_{u})}^{2}+\lambda ({|{p}_{u}|}^{2}+{|{q}_{i}|}^{2}) \end{aligned}$$
(3)

Here, K is the set of (u,i) pairs for which \(r_{u,i}\) is known (the training set). Simon Funk popularized a stochastic gradient descent optimization of Eq. 3 wherein the algorithm loops through all ratings in the training set. It modifies the parameters by a magnitude proportional to \(\gamma \) in the opposite direction of the gradient, yielding:

$$\begin{aligned} {e}_{u,i}={r}_{u,i}-{{q}_{i}}^{T}{p}_{u} \end{aligned}$$
(4)
$$\begin{aligned} {q}_{i}\leftarrow {q}_{i}+\gamma \cdot ({e}_{u,i}\cdot {p}_{u}-\lambda \cdot {q}_{i}) \end{aligned}$$
(5)
$$\begin{aligned} {p}_{u}\leftarrow {p}_{u}+\gamma \cdot ({e}_{u,i}\cdot {q}_{i}-\lambda \cdot {p}_{u}) \end{aligned}$$
(6)

2.3 Injecting Domain-Specific Information

Both user-based and item-based recommender rely on similarity component. In user-based recommender the similarity between two users is based on their ratings of items that both users have rated, likewise in item-based recommender the similarity between two items is based on ratings of users that rated both items. As a Linked Data dataset is a collection of RDF triples describing entities with RDFS vocabularies or OWL ontologies, this motivates us to extract its semantic schema features to define dataset semantic similarity, and make it as the similarity component of memory-based recommenders to develop our customized recommenders. In this section, we describe how we model a Linked Data dataset with vector space model (VSM) using semantic features, and how to calculate similarity between datasets based on the model further.

Vector Space Model for Linked Data Dataset. Vector space model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. In VSM each document is represented by a vector in a m-dimensional space, where each dimension corresponds to a term from the overall vocabulary of a given document collection. VSM was adapted to model the dataset in this paper. A Linked Data dataset uses one or more RDFS vocabularies or OWL ontologies. The vocabulary provides the terms (classes and properties) for expressing the data. A vocabulary URI, a class URI or a property URI used in triples of the dataset can be seen as semantic features of the dataset, they are called vocabulary feature, class feature and property feature respectively. Let \(T = \{t_1,t_2,...,t_m\}\) be the dictionary, that is the set of semantic features of datasets in the corpus. Formally, each dataset \(d_i = \{w_{1i},w_{2i},...,w_{mi}\}\), where \(w_{kj}\) is the weight for feature \(t_k\) in dataset \(d_i\).

Dataset representation in the VSM raises two issues: weighting the terms and measuring the feature vector similarity. The most commonly used term weighting scheme is TF-IDF (Term Frequency-Inverse Document Frequency) weighting:

$$\begin{aligned} TF-IDF(t_k,d_j)=\frac{f(t_k,d_j)}{|\{t_i\epsilon d_j\}|}\cdot log\frac{n}{|\{d_j\epsilon D:t_k\epsilon d_j\}|} \end{aligned}$$
(7)

\(f(t_k,d_j)\) is the number of times that feature \(t_k\) occurs in dataset \(d_j\).

Dataset Semantic Similarity. As stated earlier, a similarity measure is required to determine the closeness between two datasets. Many similarity measures have been derived to describe the proximity of two vectors; among those measures, cosine similarity is most widely used:

$$\begin{aligned} sim(d_i,d_j)=\frac{\sum _{k}^{}w_{ki}\cdot w_{kj}}{\sqrt{\sum _{k}^{}{w_{ki}}^{2}}\cdot \sqrt{\sum _{k}^{}{w_{kj}}^{2}}} \end{aligned}$$
(8)

As we assume each dataset is published by a unique publisher, the dataset semantic similarity can be used as user similarity component in user-based recommender as well as item similarity component in item-based recommender. Using dataset semantic similarity in memory-based methods also helps to relieve the cold start problem of recommender systems, which is common in our scenario, as the number and the variety of Linked Data datasets are increasing rapidly. A “new” dataset with few interlinkings to other datasets cannot easily be recommended in pure memory-based recommenders, because the similarity was computed on past ratings. While dataset semantic similarity is computed utilizing only the information of datasets themselves, recommendations can also be made for new datasets.

3 Experiments

3.1 Experimental Data

We construct the experimental data from the LOD Cloud 2014 dataset published in [4]. The data is a crawl of the Web of Linked Data conducted in April 2014, which contains 8,038,396 resources crawled from 900,129 documents. The crawled data is provided for download as a single N-Quads formatted 2.6 GB zipped dump file. Using the dataset URIs published by the authors, we managed to divide the dump into 990 separated dataset dumps, in which the quads whose subject’ URI defined under the same dataset URI are grouped together. For each dataset dump, we extract semantic features of property, class and vocabulary types. The property features of a dataset are obtained by grouping all the predicate URIs of RDF triples in the dataset dump. The class features are obtained by grouping the object URIs of RDF triples with (s rdfs:type o) pattern. Since the namespace of class or property URI are the URI of the vocabulary where the class and property were defined, by grouping all the namespace of class and property URIs of a dataset, we can get the vocabulary features of the dataset. Using the method we describe in Sect. 2.1, We manage to construct our experimental user-item matrix with 990 users, 990 items and 1641 ratings from users to items.

3.2 Evaluation Methodology

Evaluating Rating Accuracy. Mean Absolute Error (MAE) is used to measure the closeness of predicted ratings to the true ratings. It is defined as the average absolute difference between the n pairs \(<p_h,r_h>\) of predicted ratings and real ratings:

$$\begin{aligned} MAE=\frac{\sum _{h=1}^{n}|{p}_{h}-{r}_{h}|}{n} \end{aligned}$$
(9)

In our experiments, for each user, we take a certain percentage of the ratings as “training data" to produce recommendations, and the rest of the ratings is compared against estimated rating values to compute MAE. The results may differ as the data set is split randomly, hence for each algorithm, we run the test for 10 times and take the average score for final presentation.

Evaluating Top N Recommendations. It’s not always essential to present estimated rating values to users. In many cases, an ordered list of recommendations, from best to worst, is sufficient. So we could apply classic information retrieval metrics F1-Measure to evaluate recommenders. We adopt leave-one-out strategy, for a user we remove his top n ratings, and use his left ratings and all the other users’ ratings as training set. The final scores are calculated by averaging all the users’ test results. As the test rating records are selected in descending order of its value rather than randomly, for memory-based algorithms we do not need to repeat the experiments. For matrix factorization algorithm in which calculation was started from randomly initialized vectors, we run the test 10 times and present the average results.

3.3 Results and Discussion

The recommendation algorithms and evaluation methods are implemented with the help of a Java machine learning library called Mahout [13]. To inject domain-specific information as stated in Sect. 2.3, we implement a customized class that extends ItemSimilarity and UserSimilarity of Mahout.

To the best of our knowledge, there are rare works applying Recommender systems techniques to the problem of identifying target dataset for sameAs interlinking. For comparision, we have chosen three simple recommenders: Random, ItemAverage and ItemUserAverage and three original collaborative filtering recommenders: Item-based, User-based and Rating SGD. Random recommender produces random recommendations and preference estimates. ItemAverage recommender is a simple recommender that always estimates rating for an item to be the average of all known preference values for that item. ItemUserAverage recommender is like ItemAverage recommender, except that its estimated ratings are adjusted for the users’ average rating value. Item-based recommender is the original one implemented in Mahout, Item-{Vocabulary, Class, Property} recommenders are our customized recommenders in which the similarity components of original item-based algorithm are replaced by our dataset semantic similarity components with vocabulary, class and property features respectively. This is the same for User-based recommenders. For User-based recommenders, there are two ways for choosing neighborhoods: fixed-size neighborhoods (noted with n as the neighborhoods size parameter) and threshold-based neighborhoods (noted with t as the threshold parameter). We explored a range of possible choices of both parameters for both evaluation. For fixed-size neighborhoods, n is in the range of [1,10] with 1 as step size, for threshold-based neighborhoods, t is in the range of [0.1,0.9] with 0.1 as step size. The best results are shown in the Tables 1 and 2, and the optimized parameters are noted in cells. There are three parameters can be tuned for RatingSGD recommender, f is the number of factors used to compute this factorization, \(\gamma \) is the learning rate, and i is the number of iterations. These parameters also have been tuned for optimum performance.

When evaluating rating accuracy, we vary the percent of rating records used for training from 50 % to 90 %. In Table 1 we can see that the MAEs are around 2.5 for Random recommender. With some simple intuitions, the MAEs are lower to about 1.0 or 1.2 for ItemUserAverage and ItemAverage recommenders. Original item-based recommender in Mahout has further lower MAEs about 0.8. Our Item-{Vocabulary, Class, Property} recommender shows better performance in MAEs at the training percent 50 %, 60 % and 70 %, but worse at 80 % and 90 %. The MAEs of original user-based recommender with fixed-size neighborhoods are in the range of (0.9, 1.0), the MAEs of original user-based recommender with threshold-based neighborhoods are also in the range of (0.9, 1.0) but lower. Both User-based recommenders have better performance than Item-based recommender. RatingSGD recommender is generally better than original item and user based recommenders, but not as good as our customized recommenders. User-{Vocabulary, Class, Property} recommenders with fixed-size neighborhoods mostly have better performance than the original User-based recommender with fixed-size neighborhoods. User-{Vocabulary, Class, Property} recommenders with threshold-based neighborhoods have much better performance than the original User-based recommender with threshold-based neighborhoods, actually the MAEs of User-Class recommender with threshold-based neighborhoods are the lowest of all tested recommenders at all training percent. The values are around (0.3, 0.5), which are only half of the MAEs achieved by the best original recommender, i.e., the Item-based recommender.

Table 1. MAE comparison of various recommenders

For evaluating the Top N recommendation, we evaluate top 1 to top 10 recommendation performance of various recommenders for comprehensive comparisons. The results are shown only for top 1 to 5 in Table 2 due to the space limit, the trend of results for top 6 to 10 tests are similar. Random recommender failed completely in this test, since randomly recommending a few number of datasets out of 990 datasets can hardly hit the right answers. The other two simple recommenders also performed very poorly. Original item and user based recommenders performed better. Item-based recommender achieved F1-Measures larger than 0.1 for top {3, 5, 6, 7, 8} test cases. The F1-Measures of User-based recommender with fixed-size neighborhoods are higher and within the range of (0.2, 0.5). The F1-Measures of User-based recommender with threshold-based neighborhoods are within the range of (0.1, 0.2). RatingSGD recommender performed worse than original item and user based recommenders, the F1-Measures are always below 0.05. Our customized Item-{Vocabulary, Class, Property} recommenders have close performance compared with Item-based recommender. While User-{Vocabulary, Class, Property} recommenders all achieve much better performance compared with the original user-based ones. The User-Vocabulary recommender with fixed-size neighborhoods achieves the best F1-Measures in all the top n test cases except top 1, the F1-Measures are within the range of (0.2, 0.5), almost twice higher than the best original recommender, i.e., User-based recommender with fixed-size neighborhoods.

Table 2. F1-measure comparison for Top N Recommendation

4 Related Works

Identifying relevant datasets for interlinking is a novel research area. There are a few approaches developed specifically for this purpose, which can be categorized into two groups.

In the first category, the problem is tackled in a retrieval way, these methods try to retrieve datasets that can be interlinked with for a given dataset. Nikolov et al. [9] proposed a method that depends on an third-party semantic web search service. They use labels of randomly selected individuals from a dataset to query the search service and aggregated the results by datasets. They conducted experiments on three datasets as examples. Also not all datasets have instances with labels, Ell et al. [10] show that only 38.2 % of the non-information resources have a label. Lopes et al. [6] proposed a probabilistic approach based on Bayesian theory, they defined rank score functions that exploit vocabulary features of dataset and the known dataset links. Liu et al. [5] modeled the problem in an Information Retrieval way, they have developed a ranking function called collaborative dataset similarity which is proven to be quite effective. Using learning to rank algorithm to incorporate these ranking functions, they can further improve the performance and achieve the best MAP (Mean Average Precision) compared with previous works.

In the second category, the problem is tackled using link prediction approaches. The graphs of datasets and interlinking between them are constructed and link prediction measures are used to rank potential dataset pairs. Lopes et al. [6] represented the data space as a directed graph, and used the Preferential Attachment and the Resource Allocation to measure the likelihood that two datasets can be connected. The linear combine of these two score is used to rank the dataset pair. But when computing Preferential Attachment score, instead of using out-degree of source dataset, they used the size of similarity set of source dataset. Similarity set is defined as the set of datasets which have vocabulary features in common with source dataset. Mera et al. [7] developed a dataset interlinking recommendation tool called TRT. They implemented most of state of art local and quasi-local similarity indices, but these indices are not combined in any way. They also developed a tool called TRTML [8]. In TRTML, the interlinking of datasets was represented as an undirected graph, and four link prediction measures were implemented and three supervised classification algorithms were used. They balanced the percentage of unlinked tripleset pairs considered for better performance when comparing various algorithms, in this way the testing settings can no longer reflect the real challenges as the real-world distribution is extremely imbalanced.

5 Conclusion

The Web of Data is constantly growing, in order to be considered as Linked Data, the datasets published on the web have to be interlinked to other datasets. Data linking between two given datasets is a time-consuming process, if there are some techiques can bepublisher, it will substantially reduce the need to perform exploratory search. As the ubiquitous owl:sameAs property is used to connect these datasets, we focus on this type of link, and try to solve the problem of identifying target dataset for sameAs interlinking, when the publishers dataset has linked to a few datasets. This is the scenario that the Recommender systems techniques can be applied. We construct user-item matrix with rating values depending on the number of RDF link triples between datasets. We extract vocabulary features of dataset, and define a dataset semantic similarity algorithm as the similarity component of memory-based recommenders. The experiments show that Recommender systems techniques is effective for the problem and our customized recommenders perform better than original collaborative filtering recommenders. For future work, we plan to exploit more advanced recommendation techniques and develop more effective features focusing on the topical aspect of datasets.