Interactive learning of node selecting tree transducer

Carme, Julien; Gilleron, Rémi; Lemay, Aurélien; Niehren, Joachim

doi:10.1007/s10994-006-9613-8

Interactive learning of node selecting tree transducer

Published: 16 August 2006

Volume 66, pages 33–67, (2007)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Interactive learning of node selecting tree transducer

Download PDF

Julien Carme¹,
Rémi Gilleron¹,
Aurélien Lemay¹ &
…
Joachim Niehren¹

546 Accesses
27 Citations
Explore all metrics

Abstract

We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction.

We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished.

In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTTs—a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987).

We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction system—called SQUIRREL—by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75(2), 87–106.
Article MATH MathSciNet Google Scholar
Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with lixto. In 28th International Conference on Very Large Data Bases, pp. 119–128.
Brüggemann-Klein, A., Wood, D., & Murata, M. (2001). Regular tree and regular hedge languages over unranked alphabets: Version 1.
Carme, J., Gilleron, R., Lemay, A., & Niehren, J. (2005). Interactive learning of node selecting tree transducer. In IJCAI Workshop on Grammatical Inference.
Carme, J., Lemay, A., & Niehren, J. (2004a). Learning node selecting tree transducer from completely annotated examples. In 7th International Colloquium on Grammatical Inference, Vol. 3264 of Lecture Notes in Artificial Intelligence, (pp. 91–102). Springer Verlag.
Carme, J., Niehren, J., & Tommasi, M. (2004b). Querying unranked trees with stepwise tree automata. In 19th International Conference on Rewriting Techniques and Applications, Vol. 3091 of Lecture Notes in Computer Science, (pp. 105–118). Springer Verlag.
Chidlovskii, B. (2001). Wrapping web information providers by transducer induction. In Proc. European Conference on Machine Learning, Vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 61–73.
Cohen, W., Hurst, M., & Jensen, L. (2003). Web document analysis: challenges and opportunities, chap. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. World Scientific.
Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., & Tommasi, M. (1997). Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata.
Cristau, J., Löding, C., & Thomas, W. (2005). Deterministic automata on unranked trees. In 15th International Symposium on Fundamentals of Computation Theory. To Appear.
de la Higuera, C. (1997). Characteristic sets for polynomial grammatical inference. Machine Learning, 27, 125–137.
Article MATH Google Scholar
Drewes, F., & Hogberg, J. (2003). Learning a regular tree language from a teacher. In D.L.T. 2003, Vol. 2710 of Lecture Notes in Computer Science, (pp. 279–291).
Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In AAAI/IAAI, (pp. 577–583).
Freitag, D., & McCallum, A. K. (1999). Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.
Frick, M., Grohe, M., & Koch, C. (2003). Query evaluation on compressed trees. In 18th IEEE Symposium on Logic in Computer Science, (pp. 188–197).
Gold, E. (1967). Language identification in the limit. Inform. Control, 10, 447–474.
Article MATH Google Scholar
Gold, E. (1978). Complexity of automaton identification from given data. Inform. Control, 37, 302–320.
Article MATH MathSciNet Google Scholar
Gottlob, G., & Koch, C. (2002). Monadic queries over tree-structured data. In 17th Annual IEEE Symposium on Logic in Computer Science, (pp. 189–202) Copenhagen.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8), 521–538.
Article Google Scholar
Kermorvant, C., & de la Higuera, C. (2002). Learning language with help. In 6th International Colloquium on Grammatical Inference, Vol. 2484 of Lecture Notes in Artificial Intelligence, (pp. 161–173). Springer Verlag.
Kosala, R., Bruynooghe, M., Van den Bussche, J., & Blockeel, H. (2003). Information extraction from web documents based on local unranked tree automaton inference. In 18th International Joint Conference on Artificial Intelligence, (pp. 403–408). Morgan Kaufmann.
Kushmerick, N. (1997). Wrapper Induction for Information Extraction. Ph.D. thesis, University of Washington.
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.
Article MATH MathSciNet Google Scholar
Kushmerick, N. (2002). Finite-state approaches to web information extraction. In Proc. 3rd Summer Convention on Information Extraction.
Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Computer Science, 1433, 1–12.
Lang, K. (1992). Random DFA’s can be approximately learned from sparse uniform examples. In Proc. 5th Annu. Workshop on Comput. Learning Theory, (pp. 45–52). ACM Press, New York, NY.
Libkin, L. (2005). Logics over unranked trees: An overview. In Automata, Languages and Programming: 32nd International Colloquium, No. 3580 in Lecture Notes in Computer Science, (pp. 35–50). Springer Verlag.
Martens, W., & Niehren, J. (2006). On the minimization of XML schemas and tree automata for unranked trees. Journal of Computer and System Science. Special issue of DBPL’05.
Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2), 93–114.
Article Google Scholar
Neumann, A., & Seidl, H. (1998). Locating matches of tree patterns in forests. In Foundations of Software Technology and Theoretical Computer Science, (pp. 134–145).
Neven, F., & Schwentick, T. (2002). Query automata over finite trees. Theoretical Computer Science, 275(1–2), 633–674.
Article MATH MathSciNet Google Scholar
Neven, F., & Van Den Bussche, J. (2002). Expressiveness of structured document query languages based on attribute grammars. Journal of the ACM, 49(1), 56–100.
Article MathSciNet Google Scholar
Niehren, J., Planque, L., Talbot, J.-M., & Tison, S. (2005). N-ary queries by tree automata. In 10th International Symposium on Database Programming Languages, Vol. 3774 of Lecture Notes in Computer Science, (pp. 217–231). Springer Verlag.
Oncina, J., & García, P. (1992). Inferring regular languages in polynomial update time. In Pattern Recognition and Image Analysis, (pp. 49–61).
Oncina, J., & García, P. (1993). Inference of recognizable tree sets. Tech. rep., Departamento de Sistemas Informáticos y Computación, Universidad de Alicante. DSIC-II/47/93.
Raeymaekers, S., & Bruynooghe, M. (2004). Minimization of finite unranked tree automata. Manuscript.
Raeymaekers, S., Bruynooghe, M., & Van den Bussche, J. (2005). Learning (k,l)-contextual tree languages for information extraction. In Proceedings of ECML’2005, Vol. 3720 of Lecture Notes in Artificial Intelligence, (pp. 305–316).
Sakakibara, Y. (1990). Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76, 223–242.
Article MATH MathSciNet Google Scholar
Seidl, H. (1989). On the finite degree of ambiguity of finite tree automata. Acta Informatica, 26(6), 527–542.
Article MATH MathSciNet Google Scholar
Thatcher, J. W. (1967). Characterizing derivation trees of context-free grammars through a generalization of automata theory. Journal of Computer and System Science, 1, 317–322.
MATH MathSciNet Google Scholar
Thatcher, J. W., & Wright, J. B. (1968). Generalized finite automata with an application to a decision problem of second-order logic. Mathematical System Theory, 2, 57–82.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Mostrare project, INRIA Futurs and Lille University, LIFL, Lille
Julien Carme, Rémi Gilleron, Aurélien Lemay & Joachim Niehren

Authors

Julien Carme
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Gilleron
View author publications
You can also search for this author in PubMed Google Scholar
Aurélien Lemay
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Niehren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aurélien Lemay.

Additional information

Editor: Georgios Paliouras and Yasubumi Sakakibara

Rights and permissions

Reprints and permissions

About this article

Cite this article

Carme, J., Gilleron, R., Lemay, A. et al. Interactive learning of node selecting tree transducer. Mach Learn 66, 33–67 (2007). https://doi.org/10.1007/s10994-006-9613-8

Download citation

Received: 04 July 2005
Revised: 12 May 2006
Accepted: 22 June 2006
Published: 16 August 2006
Issue Date: January 2007
DOI: https://doi.org/10.1007/s10994-006-9613-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interactive learning of node selecting tree transducer

Abstract

Article PDF

Similar content being viewed by others

Browsing the Parse Space

Learning Recursive Probability Trees from Data

A System for Probabilistic Inductive Answer Set Programming

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interactive learning of node selecting tree transducer

Abstract

Article PDF

Similar content being viewed by others

Browsing the Parse Space

Learning Recursive Probability Trees from Data

A System for Probabilistic Inductive Answer Set Programming

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation