Abstract
Despite advances in machine learning technologies a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of “possible mappings” between the schemas may be derived from the matching result. In this paper, we study problems related to managing possible mappings between two heterogeneous XML schemas. First, we study how to efficiently generate possible mappings for a given schema matching task. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and-conquer approach. Second, storing and querying a large set of possible mappings can incur large storage and evaluation overhead. For XML schemas, we observe that their possible mappings often exhibit a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner and can be efficiently generated. Moreover, it facilitates the evaluation of a probabilistic twig query (PTQ), which returns the non-zero probability that a fragment of an XML document matches a given query. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ and present an efficient solution for it. An extensive evaluation on real-world data sets shows that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agrawal P., Sarma A.D., Ullman J., Widom J.: Foundations of uncertain-data integration. Proc. VLDB Endow. 3(1–2), 1080–1090 (2010)
Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the 18th International Conference on Data Engineering, ICDE’02, pp. 141–152. IEEE Computer Society, Washington (2002)
Alexe, B., Chiticariu, L., Miller, R.J., Pepper, D., Tan, W.C.: Muse: a system for understanding and designing mappings. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, Vancouver, Canada, pp. 1281–1284. ACM, New York. ISBN: 978-1-60558-102-6 (2008)
Arenas, M., Libkin, L.: XML data exchange: consistency and query answering. J. ACM 55(2), 1–72 (2008). http://doi.acm.org/10.1145/1346330.1346332
Arion, A., Benzaken, V., Manolescu, I., Papakonstantinou, Y.: Structured materialized views for XML queries. In: Proceedings of the 33rd international conference on Very Large Data Bases, VLDB’07, Vienna, Austria, pp. 87–98. VLDB Endowment. ISBN: 978-1-59593-649-3 (2007)
Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, Beijing, China, pp. 1–12. ACM, New York. ISBN: 978-1-59593-686-8 (2007)
Cheng, R., Gong, J., Cheung, D.W.: Managing uncertainty of XML schema matching. In: ICDE, pp. 297–308 (2010)
Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, Vancouver, Canada, pp. 861– 874. ACM, New York. ISBN: 978-1-60558-102-6 (2008)
Do, H.H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, Hong Kong, China, pp. 610–621. VLDB Endowment (2002)
Dong X.L., Halevy A., Yu C.: Data integration with uncertainty. VLDB J. 18(2), 469–500 (2009)
Fuxman, A., Hernandez, M.A., Ho, H., Miller, R.J., Papotti, P., Popa, L.: Nested mappings: schema mapping reloaded. In: Proceedings of the 32nd International Conference on Very Large Data Bases,VLDB ’06, Seoul, Korea, pp. 67–78. VLDB Endowment (2006)
Gal, A.: Managing uncertainty in schema matching with top-k schema mappings. J. Data Semant. VI, 90–114 (2006)
Gal, A., Martinez, M.V., Simari, G.I., Subrahmanian, V.S.: Aggregate query answering under uncertain schema mappings. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 940–951. IEEE Computer Society, Washington. ISBN: 978-0-7695-3545-6 (2009)
Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query efficiency in probabilistic XML models. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, Vancouver, Canada, pp. 701–714. ACM, New York. ISBN: 978-1-60558-102-6 (2008)
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’02, Madison, Wisconsin, pp. 233–246. ACM, New York. ISBN: 1-58113-507-6 (2002)
Murty K.G.: An algorithm for ranking all the assignment in increasing order of cost. Oper. Res. 16, 682–687 (1986)
Pascoal M., Captivo M., Clímaco J.: A note on a new variant of Murty’s ranking assignments algorithm. 4OR 1(3), 243–255 (2003)
Qin, L., Yu, J.X., Ding, B.: Twiglist: make twig pattern matching fast. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applications, DASFAA ’07, Bangkok, Thailand, pp. 850–862. Springer, Berlin. ISBN: 978-3-540-71702-7 (2007)
Raffio, A., Braga, D., Ceri, S., Papotti, P., Hernández, M.A.: Clip: a tool for mapping hierarchical schemas. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, Vancouver, Canada, pp. 1271–1274. ACM, New York. ISBN: 978-1-60558-102-6 (2008)
Rahm E., Bernstein P.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Roitman, H., Gal, A., Domshlak, C.: Providing top-k alternative schema matchings with ontomatcher. In: Proceedings of the International Conference on Conceptual Modeling (2008)
Vaz Salles, M.A., Dittrich, J.P., Karakashian, S.K., Girard, O.R., Blunschi, L.: iTrails: pay-as-you-go information integration in dataspaces. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, Vienna, Austria, pp. 663–674. VLDB Endowment. ISBN: 978-1-59593-649-3 (2007)
Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, Paris, France. pp. 371–382. ACM, New York. ISBN: 1-58113-859-8 (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gong, J., Cheng, R. & Cheung, D.W. Efficient management of uncertainty in XML schema matching. The VLDB Journal 21, 385–409 (2012). https://doi.org/10.1007/s00778-011-0248-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0248-4