Abstract
pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem — the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads) as a key intermediate form, we have implemented the system as a whole in ml.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Angluin, D.: Inference of reversible languages. Journal of the ACM 29(3), 741–765 (1982)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, New York, NY, USA, pp. 175–186 (2001)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen, S.F.: Bayesian grammar induction for language modeling. In: Proceedings of the 33rd Annual Meeting of the ACL, pp. 228–235 (1995)
Fisher, K., Gruber, R.: PADS: A domain specific language for processing ad hoc data. In: PLDI, pp. 295–304 (June 2005)
Fisher, K., Walker, D., Zhu, K.Q.: LearnPADS: Automatic tool generation from ad hoc data. In: SIGMOD (June 2008)
Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: Fully automatic tool generation from ad hoc data. In: POPL (January 2008)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: SIGMOD, pp. 165–176 (2000)
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Craig’s List (2008), http://www.craigslist.org/
Grünwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Heeman, P.A., Allen, J.F.: Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialog. Computational Linguistics 25(4), 527–571 (1999)
Hong, T.W.: Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College, London (2002)
Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H.: A generalized hidden markov model for the recognition of human genes in DNA. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 134–141 (1996)
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington, Department of Computer Science and Engineering (1997)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
MEGA model optimization package (2007), http://www.cs.utah.edu/~hal/megam/
PADS project (2007), http://www.padsproj.org/
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR, New York, NY, USA, pp. 235–242 (2003)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (February 1989)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Adam, L., Berger, T., Vincent, J., Della Pietra, Stephen, A.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1) (March 1996)
Vidal, E.: Grammatical inference: An introduction survey. In: ICGI, pp. 1–4 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xi, Q., Fisher, K., Walker, D., Zhu, K.Q. (2008). Ad Hoc Data and the Token Ambiguity Problem. In: Gill, A., Swift, T. (eds) Practical Aspects of Declarative Languages. PADL 2009. Lecture Notes in Computer Science, vol 5418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92995-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-92995-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92994-9
Online ISBN: 978-3-540-92995-6
eBook Packages: Computer ScienceComputer Science (R0)