Ad Hoc Data and the Token Ambiguity Problem

Xi, Qian; Fisher, Kathleen; Walker, David; Zhu, Kenny Q.

doi:10.1007/978-3-540-92995-6_7

Qian Xi¹⁸,
Kathleen Fisher¹⁹,
David Walker¹⁸ &
…
Kenny Q. Zhu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5418))

Included in the following conference series:

International Symposium on Practical Aspects of Declarative Languages

417 Accesses
2 Citations

Abstract

pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem — the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads) as a key intermediate form, we have implemented the system as a whole in ml.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Lavoisier: High-Level Selection and Preparation of Data for Analysis

Large language models: a new approach for privacy policy analysis at scale

Article Open access 22 August 2024

On the Characteristics of Language Tags on the Web

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Angluin, D.: Inference of reversible languages. Journal of the ACM 29(3), 741–765 (1982)
Article MathSciNet MATH Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)
Google Scholar
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)
Google Scholar
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, New York, NY, USA, pp. 175–186 (2001)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen, S.F.: Bayesian grammar induction for language modeling. In: Proceedings of the 33rd Annual Meeting of the ACL, pp. 228–235 (1995)
Google Scholar
Fisher, K., Gruber, R.: PADS: A domain specific language for processing ad hoc data. In: PLDI, pp. 295–304 (June 2005)
Google Scholar
Fisher, K., Walker, D., Zhu, K.Q.: LearnPADS: Automatic tool generation from ad hoc data. In: SIGMOD (June 2008)
Google Scholar
Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: Fully automatic tool generation from ad hoc data. In: POPL (January 2008)
Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: SIGMOD, pp. 165–176 (2000)
Google Scholar
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Article MathSciNet MATH Google Scholar
Craig’s List (2008), http://www.craigslist.org/
Grünwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Google Scholar
Heeman, P.A., Allen, J.F.: Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialog. Computational Linguistics 25(4), 527–571 (1999)
Google Scholar
Hong, T.W.: Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College, London (2002)
Google Scholar
Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H.: A generalized hidden markov model for the recognition of human genes in DNA. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 134–141 (1996)
Google Scholar
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington, Department of Computer Science and Engineering (1997)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Google Scholar
MEGA model optimization package (2007), http://www.cs.utah.edu/~hal/megam/
PADS project (2007), http://www.padsproj.org/
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR, New York, NY, USA, pp. 235–242 (2003)
Google Scholar
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (February 1989)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Adam, L., Berger, T., Vincent, J., Della Pietra, Stephen, A.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1) (March 1996)
Google Scholar
Vidal, E.: Grammatical inference: An introduction survey. In: ICGI, pp. 1–4 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Princeton University, USA
Qian Xi, David Walker & Kenny Q. Zhu
AT&T Research, USA
Kathleen Fisher

Authors

Qian Xi
View author publications
You can also search for this author in PubMed Google Scholar
Kathleen Fisher
View author publications
You can also search for this author in PubMed Google Scholar
David Walker
View author publications
You can also search for this author in PubMed Google Scholar
Kenny Q. Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Kansas, 2001 Eaton Hall, 1520 W. 15th St., KS 66045-7621, Lawrence, USA
Andy Gill
Universidade Nova de Lisboa, CENTRIA, PO Box 325, WV 25443, Sheperdstown, USA
Terrance Swift

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xi, Q., Fisher, K., Walker, D., Zhu, K.Q. (2008). Ad Hoc Data and the Token Ambiguity Problem. In: Gill, A., Swift, T. (eds) Practical Aspects of Declarative Languages. PADL 2009. Lecture Notes in Computer Science, vol 5418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92995-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-92995-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92994-9
Online ISBN: 978-3-540-92995-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Ad Hoc Data and the Token Ambiguity Problem

Abstract

Chapter PDF

Similar content being viewed by others

Lavoisier: High-Level Selection and Preparation of Data for Analysis

Large language models: a new approach for privacy policy analysis at scale

On the Characteristics of Language Tags on the Web

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Ad Hoc Data and the Token Ambiguity Problem

Abstract

Chapter PDF

Similar content being viewed by others

Lavoisier: High-Level Selection and Preparation of Data for Analysis

Large language models: a new approach for privacy policy analysis at scale

On the Characteristics of Language Tags on the Web

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation