Abstract
During the last years, RDF datasets from almost any knowledge domain have been published in the Linking Open Data (LOD) cloud. The Linked Open Data guidelines establish the conditions to be satisfied by resources in order to be included as part of the LOD cloud, as well as connected to previously published data. The process of publication and linkage of resources in the LOD cloud relies on: i) data cleaning and transformation into existing RDF formats, ii) storage of the data into RDF storage systems, and iii) data interlinking. Because of data source heterogeneity, generated RDF data may be ambiguous and links may be incomplete with respect to this data. Users of the Web of Data require linked data to meet high quality standards in order to develop applications that can produce trustworthy results, but data in the LOD cloud has not been curated; thus, tools are necessary to detect data quality problems. For example, researchers that study Life Sciences datasets to explain phenomena or identify anomalies, demand that their findings correspond to current discoveries, and not to the effect of low data quality standards of completeness or redundancy. In this paper we propose LiQuate, a system that uses Bayesian networks to study the incompleteness of links, and ambiguities between labels and between links in the LOD cloud, and can be applied to any domain. Additionally, a probabilistic rule-based system is used to infer new links that associate equivalent resources, and allow to resolve the ambiguities and incompleteness identified during the exploration of the Bayesian network. As a proof of concept, we applied LiQuate to existing Life Sciences linked datasets, and detected ambiguities in the data, that may compromise the confidence of the results of applications such as link prediction or pattern discovery. We illustrate a variety of identified problems and propose a set of enriched intra- and inter-links that may improve the quality of data items and links of specific datasets of the LOD cloud.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proceedings of VLDB 2007 (2007)
Broecheler, M., Mihalkova, L., Getoor, L.: Probabilistic similarity logic. In: Conference on Uncertainty in Artificial Intelligence (2010)
Ceri, S., Gottlob, G., Tanga, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Transactions on Knowledge and Data Engineering 1(1) (1989)
Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, pp. 73–78 (2003)
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press (2009)
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: WWW (2012)
Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: EDBT/ICDT Workshop on Linked Web Data Management (2011)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. SIGMOD Record 30(2), 461–472 (2001)
Guret, C., Groth, P., Stadler, C., Lehmann, J.: Linked data quality assessment through network analysis. In: ISWC 2011 Posters and Demos (2011)
Halpin, H., Hayes, P.J., McCusker, J.P., McGuinness, D.L., Thompson, H.S.: When owl: sameas isn’t the same: An analysis of identity in linked data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 305–320. Springer, Heidelberg (2010)
Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: Linkedct: A linked data space for clinical trials. CoRR, abs/0908.0567 (2009)
Hassanzadeh, O., Yeganeh, S.H., Miller, R.J.: Linking semistructured data on the web. In: WebDB (2011)
Isele, R., Jentzsch, A., Bizer, C.: Silk server - adding missing links while consuming linked data. In: 1st International Workshop on Consuming Linked Data (COLD 2010), Shanghai (2010)
Jentzsch, A., Andersson, B., Hassanzadeh, O., Stephens, S., Bizer, C.: Enabling Tailored Therapeutics with Linked Data. In: Proceedings of the WWW 2009 Workshop on Linked Data on the Web, LDOW 2009 (2009)
Kimmig, A., Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: A short introduction to probabilistic soft logic. In: NIPS Workshop on Probabilistic Programming: Foundations and Applications (2012)
Langegger, A., Wolfram, W.: Rdfstats - an extensible rdf statistics generator and library. In: DEXA Workshops (2009)
Maali, F., Cyganiak, R., Peristeras, V.: Re-using cool uris: Entity reconciliation against lod hubs. In: Proceedings of the Linked Data on the Web Workshop 2011 (LDOW 2011), WWW 2011 (2011)
Memory, A., Kimmig, A., Bach, S.H., Raschid, L., Getoor, L.: Graph summarization in annotated data using probabilistic soft logic. In: URSW (2012)
Naumann, F., Sattler, K.-U.: Information quality: Fundamentals, techniques, and use (2006)
Ruckhaus, E., Vidal, M.-E.: The BAY-HIST Prediction Model for RDF Documents. In: Proceedings of the 2nd ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web-CEUR, vol. 611, pp. 30–41 (2010)
Stankovic, M., Jovanovic, J., Laublet, P.: Linked data metrics for flexible expert search on the open web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 108–123. Springer, Heidelberg (2011)
Thor, A., Anderson, P., Raschid, L., Navlakha, S., Saha, B., Khuller, S., Zhang, X.-N.: Link prediction for annotation graphs using graph summarization. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 714–729. Springer, Heidelberg (2011)
Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., Gómez-Pérez, A.: Methodological guidelines for publishing government linked data linking government data. In: Wood, D. (ed.) Linking Government Data, ch. 2, pp. 27–49. Springer, New York (2011)
Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the web of data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 650–665. Springer, Heidelberg (2009)
W3C. OWL Web Ontology Language Reference (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ruckhaus, E., Vidal, ME. (2013). LiQuate-Estimating the Quality of Links in the Linking Open Data Cloud. In: Lacroix, Z., Ruckhaus, E., Vidal, ME. (eds) Resource Discovery. RED 2012. Lecture Notes in Computer Science, vol 8194. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45263-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-45263-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45262-8
Online ISBN: 978-3-642-45263-5
eBook Packages: Computer ScienceComputer Science (R0)