Abstract
The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. In particular, we focus on new features of XML Schema language and we propose an algorithm which is an improvement of a combination of verified approaches that is, at the same time, enough general and can be further enhanced. Using a set of experiments we illustrate the behavior of the algorithm on both real-world and artificial XML data.
This work was supported in part by Czech Science Foundation (GAČR), grant number 201/06/0756.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Available at: http://arthursclassicnovels.com/
Available at: http://www.cs.wisc.edu/niagara/data.html
Available at: http://research.imb.uq.edu.au/rnadb/
Available at: http://www.assortedthoughts.com/downloads.php
Available at: http://www.ibiblio.org/bosak/
Available at: http://oval.mitre.org/oval/download/datafiles.html
Available at: http://www.rcsb.org/pdb/uniformity/
Available at: http://www.eecs.umich.edu/db/mbench/
Available at: http://arabidopsis.info/bioinformatics/narraysxml/
Available at: http://db.uwaterloo.ca/ddbms/projects/xbench/index.html
Ahonen, H.: Generating Grammars for Structured Documents Using Grammatical Inference Methods. Report A-1996-4, Dep. of Computer Science, University of Helsinki (1996)
Bartak, R.: On-Line Guide to Constraint Programming (1998), http://kti.mff.cuni.cz/~bartak/constraints/
Berstel, J., Boasson, L.: XML Grammars. In: Nielsen, M., Rovan, B. (eds.) MFCS 2000. LNCS, vol. 1893, pp. 182–191. Springer, Heidelberg (2000)
Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: a Practical Study. In: WebDB 2004: Proc. of the 7th Int. Workshop on the Web and Databases, New York, NY, USA, pp. 79–84. ACM Press, New York (2004)
Bex, G.J., Neven, F., Vansummeren, S.: XML Schema Definitions from XML Data. In: VLDB 2007: Proc. of the 33rd Int. Conf. on Very Large Data Bases, Vienna, Austria, pp. 998–1009. ACM Press, New York (2007)
Biron, P.V., Malhotra, A.: XML Schema Part 2: Datatypes, 2nd edn. W3C (2004), http://www.w3.org/TR/xmlschema-2/
Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F.: Extensible Markup Language (XML) 1.0, 4th edn. W3C (2006)
Dorigo, M., Birattari, M., Stutzle, T.: Ant Colony Optimization – Artificial Ants as a Computational Intelligence Technique. Technical Report TR/IRIDIA/2006-023, IRIDIA, Bruxelles, Belgium (2006)
Fernau, H.: Learning XML Grammars. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 73–87. Springer, Heidelberg (2001)
Gao, S., Sperberg-McQueen, C.M., Thompson, H.S.: XML Schema Definition Language (XSDL) 1.1 Part 1: Structures. W3C (2007), http://www.w3.org/TR/xmlschema11-1/
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a System for Extracting Document Type Descriptors from XML Documents. In: SIGMOD 2000: Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, pp. 165–176. ACM Press, New York (2000)
Gold, E.M.: Language Identification in the Limit. Information and Control 10(5), 447–474 (1967)
Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: VLDB 1997: Proc. of the 23rd Int. Conf. on Very Large Data Bases, pp. 436–445. Morgan Kaufmann, San Francisco (1997)
Grunwald, P.D.: A Tutorial Introduction to the Minimum Description Principle (2005), http://homepages.cwi.nl/~pdg/ftp/mdlintro.pdf
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall College Div., Englewood Cliffs (1988)
Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a First Study. In: WWW 2003: Proc. of the 12th Int. Conf. on World Wide Web, vol. 2, pp. 500–510. ACM Press, New York (2003)
Mlynkova, I., Toman, K., Pokorny, J.: Statistical Analysis of Real XML Data Collections. In: COMAD 2006: Proc. of the 13th Int. Conf. on Management of Data, pp. 20–31. Tata McGraw-Hill Publishing Company Limited, New York (2006)
Moh, C.-H., Lim, E.-P., Ng, W.-K.: Re-engineering Structures from Web Documents. In: DL 2000: Proc. of the 5th ACM Conf. on Digital Libraries, pp. 67–76. ACM Press, New York (2000)
Murata, M., Lee, D., Mani, M.: Taxonomy of XML Schema Languages Using Formal Language Theory. ACM Trans. Inter. Tech. 5(4), 660–704 (2005)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: WebDB 2002: Proc. of the 5th Int. Workshop on the Web and Databases, Madison, Wisconsin, USA, pp. 61–66. ACM Press, New York (2002)
Peterson, D., Biron, P.V., Malhotra, A., Sperberg-McQueen, C.M.: XML Schema 1.1 Part 2: Datatypes. W3C (2006), http://www.w3.org/TR/xmlschema11-2/
Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures, 2nd edn., W3C (2004), http://www.w3.org/TR/xmlschema-1/
Wong, R.K., Sankey, J.: On Structural Inference for XML Data. Technical Report UNSW-CSE-TR-0313, School of Computer Science, The University of New South Wales (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vošta, O., Mlýnková, I., Pokorný, J. (2008). Even an Ant Can Create an XSD. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds) Database Systems for Advanced Applications. DASFAA 2008. Lecture Notes in Computer Science, vol 4947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78568-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-78568-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78567-5
Online ISBN: 978-3-540-78568-2
eBook Packages: Computer ScienceComputer Science (R0)