Abstract
The induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of Concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)
Brüggemann-Klein, A.: Regular Expressions to Finite Automata. Theoretical Computer Science 120(2), 197–213 (1993)
Brüggemann-Klein, A., Wood, D.: One-unambiguous Regular Languages. Information and Computation 142(2), 182–206 (1998)
Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive Learning of Node Selecting Tree Transducers. Machine Learning 66(1), 33–67 (2007)
Carme, J., Niehren, J., Tommasi, M.: Querying Unranked Trees with Stepwise Tree Automata. In: van Oostrom, V. (ed.) RTA 2004. LNCS, vol. 3091, pp. 105–118. Springer, Heidelberg (2004)
Champavère, J., Gilleron, R., Lemay, A., Niehren, J.: Efficient Inclusion Checking for Deterministic Tree Automata and DTDs. In: LATA (to appear, 2008)
Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: WWW, pp. 232–241 (2002)
Comon, H., Dauchet, M., Gilleron, R., Löding, C., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree Automata Techniques and Applications (revised) (October 2007), http://www.grappa.univ-lille3.fr/tata
Coste, F., Fredouille, D., Kermovant, C., de la Higuera, C.: Introducing Domain and Typing Bias in Automata Inference. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 115–126. Springer, Heidelberg (2004)
Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and Expressive Power of Logic Programming. ACM Computing Surveys 33(3), 374–425 (2001)
Finn, A., Kushmerick, N.: Multi-level Boundary Classification for Information Extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)
Gilleron, R., Marty, P., Tommasi, M., Torre, F.: Interactive Tuples Extraction from Semi-structured Data. In: WI, pp. 997–1004 (2006)
Kosala, R.: Information Extraction by Tree Automata Inference. PhD thesis, K. U. Leuven (July 2003)
Kristjansson, T.T., Culotta, A., Viola, P., McCallum, A.: Interactive Information Extraction with Constrained Conditional Random Fields. In: AAAI (2004)
Lemay, A., Niehren, J., Gilleron, R.: Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 253–267. Springer, Heidelberg (2006)
Lerman, K., Minton, S., Knoblock, C.: Wrapper Maintenance: a Machine Learning Approach. Journal of Artificial Intelligence Research 18, 149–181 (2003)
Oncina, J., Garcia, P.: Inferring Regular Languages in Polynomial Update Time. In: Pattern Recognition and Image Analysis, pp. 49–61 (1992)
Raeymaekers, S.: Information Extraction from Web Pages Based on Tree Automata Induction. PhD thesis, K. U. Leuven (January 2008)
Raeymaekers, S., Bruynooghe, M., Van den Bussche, J.: Learning (k,l)-contextual Tree Languages for Information Extraction. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 305–316. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Champavère, J., Gilleron, R., Lemay, A., Niehren, J. (2008). Schema-Guided Induction of Monadic Queries. In: Clark, A., Coste, F., Miclet, L. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2008. Lecture Notes in Computer Science(), vol 5278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88009-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-88009-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88008-0
Online ISBN: 978-3-540-88009-7
eBook Packages: Computer ScienceComputer Science (R0)