Abstract
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction.
We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished.
In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTTs—a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987).
We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction system—called SQUIRREL—by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75(2), 87–106.
Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with lixto. In 28th International Conference on Very Large Data Bases, pp. 119–128.
Brüggemann-Klein, A., Wood, D., & Murata, M. (2001). Regular tree and regular hedge languages over unranked alphabets: Version 1.
Carme, J., Gilleron, R., Lemay, A., & Niehren, J. (2005). Interactive learning of node selecting tree transducer. In IJCAI Workshop on Grammatical Inference.
Carme, J., Lemay, A., & Niehren, J. (2004a). Learning node selecting tree transducer from completely annotated examples. In 7th International Colloquium on Grammatical Inference, Vol. 3264 of Lecture Notes in Artificial Intelligence, (pp. 91–102). Springer Verlag.
Carme, J., Niehren, J., & Tommasi, M. (2004b). Querying unranked trees with stepwise tree automata. In 19th International Conference on Rewriting Techniques and Applications, Vol. 3091 of Lecture Notes in Computer Science, (pp. 105–118). Springer Verlag.
Chidlovskii, B. (2001). Wrapping web information providers by transducer induction. In Proc. European Conference on Machine Learning, Vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 61–73.
Cohen, W., Hurst, M., & Jensen, L. (2003). Web document analysis: challenges and opportunities, chap. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. World Scientific.
Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., & Tommasi, M. (1997). Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata.
Cristau, J., Löding, C., & Thomas, W. (2005). Deterministic automata on unranked trees. In 15th International Symposium on Fundamentals of Computation Theory. To Appear.
de la Higuera, C. (1997). Characteristic sets for polynomial grammatical inference. Machine Learning, 27, 125–137.
Drewes, F., & Hogberg, J. (2003). Learning a regular tree language from a teacher. In D.L.T. 2003, Vol. 2710 of Lecture Notes in Computer Science, (pp. 279–291).
Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In AAAI/IAAI, (pp. 577–583).
Freitag, D., & McCallum, A. K. (1999). Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.
Frick, M., Grohe, M., & Koch, C. (2003). Query evaluation on compressed trees. In 18th IEEE Symposium on Logic in Computer Science, (pp. 188–197).
Gold, E. (1967). Language identification in the limit. Inform. Control, 10, 447–474.
Gold, E. (1978). Complexity of automaton identification from given data. Inform. Control, 37, 302–320.
Gottlob, G., & Koch, C. (2002). Monadic queries over tree-structured data. In 17th Annual IEEE Symposium on Logic in Computer Science, (pp. 189–202) Copenhagen.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8), 521–538.
Kermorvant, C., & de la Higuera, C. (2002). Learning language with help. In 6th International Colloquium on Grammatical Inference, Vol. 2484 of Lecture Notes in Artificial Intelligence, (pp. 161–173). Springer Verlag.
Kosala, R., Bruynooghe, M., Van den Bussche, J., & Blockeel, H. (2003). Information extraction from web documents based on local unranked tree automaton inference. In 18th International Joint Conference on Artificial Intelligence, (pp. 403–408). Morgan Kaufmann.
Kushmerick, N. (1997). Wrapper Induction for Information Extraction. Ph.D. thesis, University of Washington.
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.
Kushmerick, N. (2002). Finite-state approaches to web information extraction. In Proc. 3rd Summer Convention on Information Extraction.
Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Computer Science, 1433, 1–12.
Lang, K. (1992). Random DFA’s can be approximately learned from sparse uniform examples. In Proc. 5th Annu. Workshop on Comput. Learning Theory, (pp. 45–52). ACM Press, New York, NY.
Libkin, L. (2005). Logics over unranked trees: An overview. In Automata, Languages and Programming: 32nd International Colloquium, No. 3580 in Lecture Notes in Computer Science, (pp. 35–50). Springer Verlag.
Martens, W., & Niehren, J. (2006). On the minimization of XML schemas and tree automata for unranked trees. Journal of Computer and System Science. Special issue of DBPL’05.
Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2), 93–114.
Neumann, A., & Seidl, H. (1998). Locating matches of tree patterns in forests. In Foundations of Software Technology and Theoretical Computer Science, (pp. 134–145).
Neven, F., & Schwentick, T. (2002). Query automata over finite trees. Theoretical Computer Science, 275(1–2), 633–674.
Neven, F., & Van Den Bussche, J. (2002). Expressiveness of structured document query languages based on attribute grammars. Journal of the ACM, 49(1), 56–100.
Niehren, J., Planque, L., Talbot, J.-M., & Tison, S. (2005). N-ary queries by tree automata. In 10th International Symposium on Database Programming Languages, Vol. 3774 of Lecture Notes in Computer Science, (pp. 217–231). Springer Verlag.
Oncina, J., & García, P. (1992). Inferring regular languages in polynomial update time. In Pattern Recognition and Image Analysis, (pp. 49–61).
Oncina, J., & García, P. (1993). Inference of recognizable tree sets. Tech. rep., Departamento de Sistemas Informáticos y Computación, Universidad de Alicante. DSIC-II/47/93.
Raeymaekers, S., & Bruynooghe, M. (2004). Minimization of finite unranked tree automata. Manuscript.
Raeymaekers, S., Bruynooghe, M., & Van den Bussche, J. (2005). Learning (k,l)-contextual tree languages for information extraction. In Proceedings of ECML’2005, Vol. 3720 of Lecture Notes in Artificial Intelligence, (pp. 305–316).
Sakakibara, Y. (1990). Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76, 223–242.
Seidl, H. (1989). On the finite degree of ambiguity of finite tree automata. Acta Informatica, 26(6), 527–542.
Thatcher, J. W. (1967). Characterizing derivation trees of context-free grammars through a generalization of automata theory. Journal of Computer and System Science, 1, 317–322.
Thatcher, J. W., & Wright, J. B. (1968). Generalized finite automata with an application to a decision problem of second-order logic. Mathematical System Theory, 2, 57–82.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Georgios Paliouras and Yasubumi Sakakibara
Rights and permissions
About this article
Cite this article
Carme, J., Gilleron, R., Lemay, A. et al. Interactive learning of node selecting tree transducer. Mach Learn 66, 33–67 (2007). https://doi.org/10.1007/s10994-006-9613-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-9613-8