Abstract
We study the problem of streaming regular expression parsing: Given a regular expression and an input stream of symbols, how to output a serialized syntax tree representation as an output stream during input stream processing.
We show that optimally streaming regular expression parsing, outputting bits of the output as early as is semantically possible for any regular expression of size m and any input string of length n, can be performed in time O(2m logm + m n) on a unit-cost random-access machine. This is for the wide-spread greedy disambiguation strategy for choosing parse trees of grammatically ambiguous regular expressions. In particular, for a fixed regular expression, the algorithm’s run-time scales linearly with the input string length. The exponential is due to the need for preprocessing the regular expression to analyze state coverage of its associated NFA, a PSPACE-hard problem, and tabulating all reachable ordered sets of NFA-states.
Previous regular expression parsing algorithms operate in multiple phases, always requiring processing or storing the whole input string before outputting the first bit of output, not only for those regular expressions and input prefixes where reading to the end of the input is strictly necessary.
This work has been partially supported by The Danish Council for Independent Research under Project 11-106278, “Kleene Meets Church: Regular Expressions and Types”. The order of authors is insignificant.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Information and computation 140(2), 229–253 (1998), http://dx.doi.org/10.1006/inco.1997.2688
Debarbieux, D., Gauwin, O., Niehren, J., Sebastian, T., Zergaoui, M.: Early nested word automata for XPath query answering on XML streams. In: Konstantinidis, S. (ed.) CIAA 2013. LNCS, vol. 7982, pp. 292–305. Springer, Heidelberg (2013)
Dubé, D., Feeley, M.: Efficiently building a parse tree from a regular expression. Acta Informatica 37(2), 121–144 (2000), http://dx.doi.org/10.1007/s002360000037
Fowler, G.: An interpretation of the POSIX regex standard (January 2003), http://www2.research.att.com/~astopen/testregex/re-interpretation.html
Frisch, A., Cardelli, L.: Greedy regular expression matching. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 618–629. Springer, Heidelberg (2004), http://dx.doi.org/10.1007/978-3-540-27836-8_53
Grathwohl, N.B.B., Henglein, F., Nielsen, L., Rasmussen, U.T.: Two-pass greedy regular expression parsing. In: Konstantinidis, S. (ed.) CIAA 2013. LNCS, vol. 7982, pp. 60–71. Springer, Heidelberg (2013), http://dx.doi.org/10.1007/978-3-642-39274-0_7
Gupta, A.K., Suciu, D.: Stream processing of XPath queries with predicates. In: Proc. 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 419–430. ACM, New York (2003), http://dx.doi.org/10.1145/872757.872809
IEEE Computer Society: Standard for Information Technology - Portable Operating System Interface (POSIX), Base Specifications, Issue 7. IEEE (2008), http://dx.doi.org/10.1109/IEEESTD.2008.4694976 , IEEE Std 1003.1
Kearns, S.: Extending regular expressions with context operators and parse extraction. Software - Practice and Experience 21(8), 787–804 (1991), http://dx.doi.org/10.1002/spe.4380210803
Nielsen, L., Henglein, F.: Bit-coded regular expression parsing. In: Dediu, A.-H., Inenaga, S., Martín-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 402–413. Springer, Heidelberg (2011), http://dx.doi.org/10.1007/978-3-642-21254-3_32
Okui, S., Suzuki, T.: Disambiguation in regular expression matching via position automata with augmented transitions. In: Domaratzki, M., Salomaa, K. (eds.) CIAA 2010. LNCS, vol. 6482, pp. 231–240. Springer, Heidelberg (2011), http://dx.doi.org/10.1007/978-3-642-18098-9_25
Stockmeyer, L.J., Meyer, A.R.: Word problems requiring exponential time (preliminary report). In: Proc. Fifth Annual ACM Symposium on Theory of Computing, pp. 1–9. ACM (1973)
Sulzmann, M., Lu, K.Z.M.: Regular expression sub-matching using partial derivatives. In: Proc. 14th Symposium on Principles and Practice of Declarative Programming, PPDP 2012, pp. 79–90. ACM, New York (2012), http://dx.doi.org/10.1145/2370776.2370788
Sulzmann, M., Lu, K.Z.M.: POSIX regular expression parsing with derivatives. In: Codish, M., Sumii, E. (eds.) FLOPS 2014. LNCS, vol. 8475, pp. 203–220. Springer, Heidelberg (2014), http://dx.doi.org/10.1007/978-3-319-07151-0_13
Thompson, K.: Programming techniques: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968), http://dx.doi.org/10.1145/363347.363387
Wall, L., Christiansen, T., Orwant, J.: Programming Perl. O’Reilly Media, Incorporated (2000)
Wu, X., Theodoratos, D.: A survey on XML streaming evaluation techniques. The VLDB Journal 22(2), 177–202 (2013), http://dx.doi.org/10.1007/s00778-012-0281-y
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Grathwohl, N.B.B., Henglein, F., Rasmussen, U.T. (2014). Optimally Streaming Greedy Regular Expression Parsing. In: Ciobanu, G., Méry, D. (eds) Theoretical Aspects of Computing – ICTAC 2014. ICTAC 2014. Lecture Notes in Computer Science, vol 8687. Springer, Cham. https://doi.org/10.1007/978-3-319-10882-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-10882-7_14
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10881-0
Online ISBN: 978-3-319-10882-7
eBook Packages: Computer ScienceComputer Science (R0)