Abstract
Discovering cross-knowledge-base links is of central importance for manifold tasks across the Linked Data Web. So far, learning link specifications has been addressed by approaches that rely on standard similarity and distance measures such as the Levenshtein distance for strings and the Euclidean distance for numeric values. While these approaches have been shown to perform well, the use of standard similarity measure still hampers their accuracy, as several link discovery tasks can only be solved sub-optimally when relying on standard measures. In this paper, we address this drawback by presenting a novel approach to learning string similarity measures concurrently across multiple dimensions directly from labeled data. Our approach is based on learning linear classifiers which rely on learned edit distance within an active learning setting. By using this combination of paradigms, we can ensure that we reduce the labeling burden on the experts at hand while achieving superior results on datasets for which edit distances are useful. We evaluate our approach on three different real datasets and show that our approach can improve the accuracy of classifiers. We also discuss how our approach can be extended to other similarity and distance measures as well as different classifiers.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to linked data and its lifecycle on the web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 1–75. Springer, Heidelberg (2011)
Balcan, M.-F., Blum, A., Srebro, N.: Improved guarantees for learning via similarity functions. In: COLT, pp. 287–298 (2008)
Bellet, A., Habrard, A., Sebban, M.: Learning good edit similarities with generalization guarantees. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS (LNAI), vol. 6911, pp. 188–203. Springer, Heidelberg (2011)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)
Cristianini, N., Shawe-Taylor, J.: An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge University Press (2000)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271 (1959)
Fredman, M.L., Tarjan, R.E.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34, 596–615 (1987)
Hertz, T.: Learning Distance Functions: Algorithms and Applications. PhD thesis, Hebrew University of Jerusalem (2006)
Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB (2011)
Isele, R., Bizer, C.: Learning linkage rules using genetic programming. In: 6th International Workshop on Ontology Matching, Bonn (2011)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: A partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Ngonga Ngomo, A.-C.: A time-efficient hybrid approach to link discovery. In: Proceedings of OM@ISWC (2011)
Ngonga Ngomo, A.-C., Auer, S.: Limes - a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI (2011)
Ngonga Ngomo, A.-C., Lehmann, J., Auer, S., Höffner, K.: RAVEN – Active Learning of Link Specifications. In: Sixth International Ontology Matching Workshop (2011)
Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: Efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012)
Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012)
Pavel, S., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering 99 (2012)
Raimond, Y., Sutton, C., Sandler, M.: Automatic interlinking of music datasets on the semantic web. In: Proceedings of LDoW (2008)
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5), 522–532 (1998)
Scharffe, F., Liu, Y., Zhou, C.: Rdf-ai: an architecture for rdf datasets matching, fusion and interlink. In: IK-KR IJCAI Workshop (2009)
Settles, B.: Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison (2009)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Soru, T., Ngonga Ngomo, AC. (2013). Active Learning of Domain-Specific Distances for Link Discovery. In: Takeda, H., Qu, Y., Mizoguchi, R., Kitamura, Y. (eds) Semantic Technology. JIST 2012. Lecture Notes in Computer Science, vol 7774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37996-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-37996-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37995-6
Online ISBN: 978-3-642-37996-3
eBook Packages: Computer ScienceComputer Science (R0)