Keywords

1 Introduction

Grammars based on hyperedge replacement (HR) generate a well-studied class of context-free graph languages [16]. However, their generative power is too weak; e.g., their languages are known to have bounded treewidth [16, Thm. IV.3.12(7)]. Since this even excludes a language as simple as that of all graphs, HR grammars cannot reasonably be advocated for specifying graph models in general.

An example illustrating this weakness of hyperedge replacement is provided by “unstructured” flowcharts with jumps (see Sect. 6). Since jumps can target any location in the program, an edge that represents such a jump may point to any arbitrary node (representing a program location). Inserting such edges is beyond what hyperedge replacement can do because it would require nonterminal hyperedges of unbounded arity, such as the adaptive star grammars of [8].

A similar example is Abstract Meaning Representation [1], a representation of the meaning of natural language sentences that is being heavily studied in computational linguistics. Coreferences caused by, e.g., pronouns that refer to entities mentioned elsewhere in a sentence, give rise to edges that point to nodes which may be almost anywhere else in the graph. Hence, again, hyperedge replacement is too weak.

Contextual hyperedge replacement (CHR) has been devised as a moderate extension of HR that overcomes such restrictions, while preserving many other formal properties of HR [7, 9]. Rather than having left-hand sides consisting of nonterminal edges only, CHR rules can have additional isolated context nodes in their left-hand side, to which the right-hand side can attach edges. Hence, CHR can attach edges to already generated nodes elsewhere in the graph, but the gain in power is limited as the mechanism lacks control over which nodes to choose. (The application conditions of [17] are not yet supported by Grappa.)

In recent work, we have devised a very efficient predictive shift-reduce (PSR) parsing algorithm for a subclass of HR grammars. In this paper, we extend this algorithm to contextual HR grammars. Its implementation in the graph-parser distiller GrappaFootnote 1 turned out to be smooth, and yields parsers that are as efficient as those for the context-free case. This perhaps surprisingly good result is due to the fact that both parsers consume one edge after another and apply rules backwards until the start symbol is reached. As in the context-free case, the grammar analysis by the distiller ensures that suitable edges can be chosen in constant time, and backtracking is avoided. Hence, the overall running time of the generated parser remains linear.

The rest of this paper is structured as follows. Section 2 introduces CHR grammars. In Sect. 3 we recall PSR parsing and discuss the point where it has to be modified for CHR grammars. A particular normal form needed to parse CHR grammars is introduced in Sect. 4, before we discuss the analysis of lookahead in Sect. 5. In Sect. 6 we discuss a more realistic example grammar (of flowcharts), compare the efficiency of different parsers for this grammar, and evaluate some CHR grammars wrt. PSR-parsability with Grappa. We conclude by summarizing the results obtained so far, and indicate related and future work in Sect. 7. Due to lack of space, our presentation is driven by a small artificial example, and properties like unique start nodes and free edge choice are not discussed here in order to keep the paper focused. The complete constructions and proofs for the base case, PSR parsing for HR grammars, can be found in [13].

2 Contextual Hyperedge Replacement Grammars

We first compile the basic notions and notation used in this paper. Throughout the paper, \(\mathbb {N}\) denotes the non-negative integers and \(A^*\) denotes the set of all finite sequences over a set A, with \(\varepsilon \) denoting the empty sequence.

We let X be a global, countably infinite supply of nodes or vertices.

Definition 1

(Graph). An alphabet is a set \(\varSigma \) of symbols together with an arity function \( arity :\varSigma \rightarrow \mathbb {N}\). Then a literal \(\varvec{e} = \mathsf {a}^{x_1 \cdots x_k}\) over \(\varSigma \) consists of a symbol \(\mathsf {a} \in \varSigma \) and an attachment \(x_1 \cdots x_k\) of \(k= arity (\mathsf {a})\) pairwise distinct nodes \(x_1,\dots ,x_k\in X\). We denote the set of all literals over \(\varSigma \) by \( Lit _{\varSigma }\).

A graph \(\gamma = \langle {V}, { \varphi }\rangle \) over \(\varSigma \) consists of a finite set \(V \subseteq X\) of nodes and a sequence \(\varphi = \varvec{e}_1 \cdots \varvec{e}_n\in Lit _{\varSigma }^*\) such that all nodes in these literals are in V. \(\mathcal {G}_{\varSigma }\) denotes the set of all graphs over \({\varSigma }\).

We say that two graphs \(\gamma = \langle {V}, { \varphi }\rangle \) and \(\gamma ' = \langle {V'}, { \varphi '}\rangle \) are equivalent, written \(\gamma \bowtie \gamma '\), if \(V = V'\) and \(\varphi \) is a permutation of \(\varphi '\).

Note that graphs are sequences of literals, i.e., two graphs \(\langle {V}, { \varphi }\rangle \) and \(\langle {V'}, { \varphi '}\rangle \) are considered to differ even if \(V=V'\) and \(\varphi '\) is just a permutation of \(\varphi \). However, such graphs are considered equivalent, denoted by the equivalence relation \(\bowtie \). “Ordinary” graphs would rather be represented using multisets of literals. The equivalence classes of graphs, therefore, correspond to conventional graphs. The ordering of literals is technically convenient for the constructions in this paper. However, input graphs to be parsed should of course be considered up to equivalence. To make sure that this is the case, our parsers always treat the remaining (not yet consumed) edge literals as a multiset rather than a sequence.

An injective function \(\varrho :X \rightarrow X\) is called a renaming. Moreover, \(\gamma ^\varrho \) denotes the graph obtained by renaming all nodes in \(\gamma \) according to \(\varrho \).

For a graph \(\gamma = \langle {V}, { \varphi }\rangle \), we use the notations \(X(\gamma ) = V\) and \( lit (\gamma )= \varphi \). We define the concatenation of two graphs \(\alpha , \beta \in \mathcal {G}_{\varSigma }\) as \(\alpha \beta =\langle X(\alpha ) \cup X(\beta ), lit (\alpha )\, lit (\beta )\rangle \). If a graph \(\gamma \) is completely determined by its sequence \( lit (\gamma )\) of literals, i.e., if each node in \(X(\gamma )\) also occurs in some literal in \( lit (\gamma )\), we simply use \( lit (\gamma )\) as a shorthand for \(\gamma \). In particular, a literal \(\varvec{e} \in Lit _{\varSigma }\) is identified with the graph consisting of just this literal and its attached nodes.

We now recall contextual hyperedge replacement from [7, 9]. To keep the technicalities simple, we omit node labels. Adding them does not pose any technical or implementational difficulties. Node labels are actually available in Grappa, but discussing them here would only complicate the exposition.Footnote 2

Definition 2

(CHR Grammar). Let the alphabet \(\varSigma \) be partitioned into disjoint subsets \(\mathcal {N}\) and \(\mathcal {T}\) of nonterminals and terminals, respectively. A contextual hyperedge replacement rule \(r=(\alpha \rightarrow \beta )\) (a rule for short) has a graph \(\alpha \in \mathcal {G}_{\varSigma }\) with a single literal \( lit (\alpha )= \varvec{A} \in Lit _{\mathcal {N}}\) as its left-hand side, and a graph \(\beta \in \mathcal {G}_{\varSigma }\) with \(X(\alpha ) \subseteq X(\beta )\) as its right-hand side. The nodes in \(X(\alpha ) \setminus X(\varvec{A})\) are called context nodes of r. A rule without context nodes is called context-free.

Consider a graph \(\gamma = \delta \varvec{A'} \delta ' \in \mathcal {G}_{\varSigma }\) and a rule r as above. A renaming \(\mu \) is a match (of r to \(\gamma \)) if \(\varvec{A}^\mu = \varvec{A'}\) andFootnote 3

$$\begin{aligned} X(\gamma ) \cap X(\beta ^\mu ) \subseteq X(\alpha ^\mu ) \subseteq X(\gamma ). \end{aligned}$$
(1)

A match \(\mu \) of r derives \(\gamma \) to the graph \(\gamma ' = \delta \beta ^\mu \delta '\). This is denoted as \(\gamma \Rightarrow _{r,\mu } \gamma '\), or just as \(\gamma \Rightarrow _r \gamma '\). We write \(\gamma \Rightarrow _\mathcal {R}\gamma '\) for a set \(\mathcal {R}\) of rules if \(\gamma \Rightarrow _r \gamma '\) for some \(r\in \mathcal {R}\), and denote the reflexive-transitive closure of \(\Rightarrow _\mathcal {R}\) by \(\Rightarrow _\mathcal {R}^*\), as usual.

A contextual hyperedge replacement grammar \(\varGamma = (\varSigma ,\mathcal {T},\mathcal {R},\mathsf {Z})\) (CHR grammar for short) consists of finite alphabets \(\varSigma ,\mathcal {T}\) as above, a finite set \(\mathcal {R}\) of rules over \(\varSigma \), and a start symbol \(\mathsf {Z}\in \mathcal {N}\) of arity 0. \(\varGamma \) generates the language \({\mathcal {L}}(\varGamma ) = \{ g \in \mathcal {G}_{\mathcal {T}} \mid \mathsf {Z}\mathrel {\mathop \Rightarrow \nolimits _\mathcal {R}^*} g \}\) of terminal graphs. We call a graph g valid with respect to \(\varGamma \) if \({\mathcal {L}}(\varGamma )\) contains a graph \(g'\) with \(g \bowtie g'\).

Context-free rules are in fact hyperedge replacement (HR) rules as defined in [13, Def. 2.2], and thus CHR grammars with context-free rules only are HR grammars. Note, however, that derivations in [13] are always rightmost derivations that require \(\delta ' \in \mathcal {G}_{\mathcal {T}}\) in every derivation step. Example 1 demonstrates why derivations for contextual grammars cannot be restricted to just rightmost ones.

Our running example is an artificial CHR grammar chosen for the purpose of illustration only. More practical grammars are considered in Sect. 6.

Example 1

The CHR grammars \(\varGamma \) has \(\mathcal {N}= \{\mathsf {Z}, \mathsf {A}, \mathsf {B}, \mathsf {C}, \mathsf {D}\}\) and \(\mathcal {T}=\{\mathsf {a}, \mathsf {b}, \mathsf {c},\mathsf {d}\}\). \(\mathsf {Z}\) has arity 0, whereas \(\mathsf {A}\), \(\mathsf {B}\), and \(\mathsf {C}\) have arity 1; all other labels are binary. \(\varGamma \) has the following rules:

$$\begin{aligned} \mathsf {Z}&{\mathop {\longrightarrow }\limits ^{z}} \mathsf {A}^{x} \mathsf {B}^{x}&\mathsf {B}^{x}&{\mathop {\longrightarrow }\limits ^{b_1|b_2}} \mathsf {C}^{x} \mathsf {b}^{x y} \mid \mathsf {D}^{x y} \mathsf {b}^{x y}&\mathsf {D}^{x y}&{\mathop {\longrightarrow }\limits ^{d}}\mathsf {d}^{x z} \mathsf {d}^{z y} \\ \mathsf {A}^{x}&{\mathop {\longrightarrow }\limits ^{a}} \mathsf {a}^{x y}&\mathsf {C}^{x} {+} y&{\mathop {\longrightarrow }\limits ^{c}} \mathsf {D}^{x y} \mathsf {c}^{x y} \end{aligned}$$

In rule c, \(\mathsf {C}^{x} {+} y\) is a shorthand for the graph \(\langle {\{x}, {y\},{\mathsf {C}^{x}}}\rangle \) with context node y; the other rules are context-free.

Fig. 1.
figure 1

Graphs \(g_1, g_2, g_3\) and their derivations in \(\varGamma \).

Figure 1 illustrates the three graphs \(g_1, g_2, g_3\) that constitute the language generated by \(\varGamma \) (up to renaming).Footnote 4 As usual, nodes are drawn as circles whereas (binary) edges are drawn as arrows with their label ascribed.

Moreover, Fig. 1 shows the derivations of \(g_1,g_2,g_3\). Underlined nonterminal literals are those rewritten in the next derivation step. The rightmost derivation (2) uses just context-free rules to derive \(g_1\). Both derivation (3) and (4) use the contextual rule c in their fourth step. There is only a subtle difference: (3) uses node 2 as context node, whereas (4) uses node 3. Neither derivation is rightmost. While (4) could be turned into a rightmost one, there is no rightmost derivation for \(g_2\): In (3), \(\mathsf {A}^{1}\) must be rewritten before \(\mathsf {C}^{1}\) because the rule rewriting \(\mathsf {C}^{1}\) uses 2 as a context node, which has to be created by rewriting \(\mathsf {A}^{1}\).

   \(\square \)

3 Making Shift-Reduce Parsing Predictive

In this section, we recall how a nondeterministic (and inefficient) shift-reduce parsing is made predictive (and efficient), by using a characteristic finite-state automaton (CFA) for control, and by inspecting lookahead. The algorithm developed for HR grammars [13] carries over to CHR grammars in many cases; an exception that requires some modification is discussed at the end of this section.

Nondeterministic Shift-Reduce Parsers read an input graph, and keep a stack of literals (nonterminals and terminals) that have been processed so far. They perform two kinds of actions. Shift reads an unread literal of the input graph and pushes it onto the stack. Reduce can be applied if literals on top of the stack form the right-hand side of a rule (after a suitable match of their nodes). Then the parser pops these literals off the stack and pushes the left-hand side of the rule onto it (using the same match). The parser starts with an empty stack and an input graph, and accepts this input if the stack just contains the start graph \(\mathsf {Z}\) and the input has been read completely.

Table 1. Shift-reduce parse of graph \(g_3\).

Table 1 shows a parse of graph \(g_3\) in Example 1. Each row shows the current stack, which grows to the right, the sequence of literals read so far, and the multiset of yet unread literals. Note that the literals of the input graph can be shifted in any order; the parser has to choose the literals so that it can construct a reverse rightmost derivation of the input graph. The last column in the table indicates the parser action that yields the situation shown in the next row. Underlined literals on the stack are those popped off the stack by the next reduction step.

We have shown in [13, Sect. 4] that such a parser can find an accepting parse if and only if the input graph is valid. But this parser is highly nondeterministic. For instance, it could start with shifting any literal in step 0, but only shifting \(\mathsf {a}^{1 2}\) leads to a successful parse. So it must employ expensive backtracking whenever it runs into a dead end.

Characteristic Finite-State Automata (CFAs) are used to reduce the nondeterminism in shift-reduce parsers. This concept has been transferred from LR parsing for context-free string grammars in [13, Sect. 5–7]. The CFA records items of the grammar which the parser is processing, where an item is a rule with a dot in its right-hand side that indicates how far processing has advanced. Figure 2 shows the characteristic items for the steps of the parse in Table 1. The numbering of item sets corresponds to the steps in Table 1. Each of these sets corresponds to a state of the CFA, with a particular renaming of nodes.Footnote 5

Fig. 2.
figure 2

Characteristic items and states for the steps of the parse of graph \(g_3\) in Table 1.

In step 0 of the parse, the parser starts with the item \(\mathsf {Z}{\mathop {\rightarrow }\limits ^{z}} {\,\centerdot \,}\mathsf {A}^{x} \mathsf {B}^{x}\), where the dot at the beginning indicates that nothing has been processed so far. As the dot is before the nonterminal literal \(\mathsf {A}^{x}\), which can be replaced using rule a, the corresponding configuration in Fig. 2 also contains the item \(\mathsf {A}^{x} {\mathop {\rightarrow }\limits ^{a}} {\,\centerdot \,}\mathsf {a}^{x y}\). So the parser can read a literal \(\mathsf {a}^{x y}\) (with a suitable match of x and y). It cannot process the nonterminal \(\mathsf {A}^{x}\), as only terminals can be shifted; so shifting \(\mathsf {a}^{1 2}\) is the only action fitting the grammar in step 0. As a consequence, step 1 is characterized by the sole item \(\mathsf {A}^{x} {\mathop {\rightarrow }\limits ^{a}} \mathsf {a}^{x y} {\,\centerdot \,}\frac{x}{1} \frac{y}{2}\), which indicates that the right-hand side of rule a has been processed completely and x and y have been matched to nodes 1 and 2, respectively. This implies that \(\mathsf {a}^{1 2}\) can be reduced to \(\mathsf {A}^{1}\), which turns item \(\mathsf {Z}{\mathop {\rightarrow }\limits ^{z}} {\,\centerdot \,}\mathsf {A}^{x} \mathsf {B}^{x}\) of step 0 into item \(\mathsf {Z}{\mathop {\rightarrow }\limits ^{z}} \mathsf {A}^{x} {\,\centerdot \,}\mathsf {B}^{x} \frac{x}{1}\) in step 2. Step 2 contains further items because the dot is in front of nonterminal \(\mathsf {B}^{x}\).

Transitions in the CFA move the dot across a literal in some of its items, and match nodes accordingly. The transitions in Fig. 2 are labeled with these literals. Note that the sequence of literals along any path starting in state 0 equals the stack of the step that is reached by the path.

Every shift-reduce parser can be controled by the CFA so that it will only choose actions in accordance with the grammar, which are promising to find a successful parse if the input graph is valid [13, Sect. 9]. Still some sources of nondeterminism remain, which have to be resolved by different means.

Lookahead may be used when a set of items allows several promising actions. For instance, consider step 5 in Fig. 2, where the dot is in front of \(\mathsf {b}^{1 3}\) and \(\mathsf {c}^{1 3}\) (under the match \(\frac{x}{1} \frac{y}{3}\)), which both occur in the unread input. Only shifting \(\mathsf {c}^{1 3}\) leads to a successful parse of \(g_3\). If the parser shifted \(\mathsf {b}^{1 3}\) instead, the next steps would reduce for rules \(b_2\) and z, yielding a stack \(\mathsf {Z}\), leaving \(\mathsf {c}^{1 3}\) unread so that the parse would fail. In such a situation, one must check which literals may follow later when either of the actions is selected. An analysis of grammar \(\varGamma \) (prior to parsing) reveals that selecting \(\mathsf {c}^{1 3}\) will allow to shift \(\mathsf {b}^{1 3}\) later, whereas selecting \(\mathsf {b}^{1 3}\) will never allow to shift \(\mathsf {c}^{1 3}\) later. So the predictive shift-reduce (PSR) parser must shift \(\mathsf {c}^{1 3}\) in step 5. In general (not for \(\varGamma \)), a HR grammar may have states with conflicts where the lookahead does not determine which shift or reduction should be done in some state. Then the grammar is not PSR-parsable [13, Sect. 9]. We will discuss the analysis of lookahead in Sect. 5.

Table 2. Wrong shift-reduce parse of graph \(g_4\). Steps 0–4 are essentially the same as in Table 1

Context Nodes require a modification of the PSR parsing algorithm. For a HR grammar, a PSR parser can always continue its parse to a successful one (for some remaining input) as long as all actions comply with the CFA. This does not necessarily hold for CHR grammars.

For instance, consider the invalid graph \(g_4 = \mathsf {a}^{1 2} \mathsf {d}^{1 4} \mathsf {d}^{4 2} \mathsf {b}^{1 2}\). Its parse, shown in Table 2, starts with the same actions 0–5 as for \(g_3\) in Table 1 and Fig. 2. However, only \(\mathsf {b}^{1 2}\) is unread in step 5, which is then shifted. Therefore, the parser will eventually accept \(g_4\) as all literals have been read, although \(g_4\) is invalid! In fact, the reduction in step 6 is wrong, because condition (1) in Definition 2 is violated: The reduce action is the reverse of the derivation \(\mathsf {A}^{1} \mathsf {B}^{1} \Rightarrow _{b_2} \mathsf {A}^{1} \mathsf {D}^{1 2} \mathsf {b}^{1 2}\) that creates node 2, which is also created when deriving \(\mathsf {A}^{1}\). But the error happened already in step 5 with its characteristic items \(\mathsf {B}^{x} {\mathop {\rightarrow }\limits ^{b_2}} \mathsf {D}^{x y} {\,\centerdot \,}\mathsf {b}^{x y} \frac{x}{1} \frac{y}{2}\) and \(\mathsf {C}^{x} {+} y {\mathop {\rightarrow }\limits ^{c}} \mathsf {D}^{x y} {\,\centerdot \,}\mathsf {c}^{x y} \frac{x}{1} \frac{y}{2}\). Since node 2 has been reused for y, it must be a context node, i.e., the first of these items, which is based on the context-free rule \(b_2\), is not valid in step 5. Apparently, the CFA does not treat context nodes correctly. We shall see in the following section that a CHR grammar like \(\varGamma \) has to be transformed before parsing in order to solve the problem.

4 IOC Normalform

The problem described in the previous section could have been avoided if the parser would “know” in step 5 that node 2 of the nonterminal literal \(\mathsf {D}^{1 2}\) is a context node. This would be easy if it held for all occurrences of a \(\mathsf {D}\)-literal in \(\varGamma \), but this is not true for the occurrence of \(\mathsf {D}^{x y}\) in \(b_2\). In the following, we consider only CHR grammars where such situations cannot occur. We require that the label of a nonterminal literal always determines the roles of its attached nodes. We then say that a CHR grammar is in IOC normalform. Fortunately, every CHR grammar can be transformed into an equivalent IOC normalform. Before we define the IOC normalform in a formal manner, we discuss roles of nodes and profiles of nonterminals.

Nonterminal literals are produced by reduce actions, i.e., if the dot is moved across the nonterminal literal within a rule. For instance, consider literal \(\mathsf {D}^{x y}\) in item \(\mathsf {B}^{x} \rightarrow {\,\centerdot \,}\mathsf {D}^{x y} \mathsf {b}^{x y} \frac{x}{1}\) in Fig. 2. Node x is already bound to node 1 before the dot is moved across \(\mathsf {D}^{x y}\), whereas y is unbound before, but bound afterwards (to node 3 in step 5). Nodes x and y act like in and out parameters, respectively, of a procedure; we say that x has role I (for “in”) and y has role O (for “out”). By combining I and O for x and y, we say that \(\mathsf {D}^{x y}\) has profile IO in this particular situation. However, the situation is different for \(\mathsf {D}^{x y}\) in item \(\mathsf {C}^{x} {+} y \rightarrow {\,\centerdot \,}\mathsf {D}^{x y} \mathsf {c}^{x y} \frac{x}{1}\). Node x again has role I, and y is again not bound before moving the dot across \(\mathsf {D}^{x y}\). But y must then be bound to the context node of this rule; we say that y has role C (for “context”), and \(\mathsf {D}^{x y}\) has profile IC in this situation. So the profile of \(\mathsf {D}^{x y}\) is not determined by its label \(\mathsf {D}\). This must not happen for a CHR grammar in IOC normalform.

Definition 3

A CHR grammar \(\varGamma = (\varSigma ,\mathcal {T},\mathcal {R},\mathsf {Z})\) is in IOC normalform if there is a function \(P:\mathcal {N}\rightarrow \{ I,O,C \}^*\) so that, for every rule \((\alpha \rightarrow \beta ) \in \mathcal {R}\) and every nonterminal literal \(\varvec{B} = B^{y_1 \ldots y_m} \in Lit _{\mathcal {N}}\) occurring in \(\beta \), i.e., \(\beta = \delta \varvec{B} \delta '\) for some \(\delta ,\delta ' \in Lit _{\varSigma }^*\), \(P(B) = p_1 \cdots p_m\) with

$$ p_i = \left\{ \begin{array}{ll} I &{} \text { if } y_i \in X(\delta ) \\ O &{} \text { if } y_i \notin X(\delta ) \wedge y_i \notin X(\alpha ) \\ C &{} \text { if } y_i \notin X(\delta ) \wedge y_i \in X(\alpha ) \setminus X(\varvec{A}) \\ p'_j &{} \text { if } y_i \notin X(\delta ) \wedge y_i = x_j \end{array} \right. \qquad (\text {for } i = 1, \ldots , m) $$

where \( lit (\alpha )= \varvec{A} = A^{x_1 \ldots x_k}\) and \(P(A) = p'_1 \cdots p'_k\).

Example 2

CHR grammar \(\varGamma \) of Example 1 can be turned into a CHR grammar \(\varGamma '\) in IOC normalform by splitting up the nonterminal label \(\mathsf {D}\) into \(\mathsf {D}_1\) and \(\mathsf {D}_2\) and using the following rules:

$$\begin{aligned} \mathsf {Z}&{\mathop {\longrightarrow }\limits ^{z}} \mathsf {A}^{x} \mathsf {B}^{x}&\mathsf {B}^{x}&{\mathop {\longrightarrow }\limits ^{b_1|b_2}} \mathsf {C}^{x} \mathsf {b}^{x y} \mid \mathsf {D}_1^{x y} \mathsf {b}^{x y}&\mathsf {D}_1^{x y}&{\mathop {\longrightarrow }\limits ^{d_1}}\mathsf {d}^{x z} \mathsf {d}^{z y} \\ \mathsf {A}^{x}&{\mathop {\longrightarrow }\limits ^{a}} \mathsf {a}^{x y}&\mathsf {C}^{x} {+} y&{\mathop {\longrightarrow }\limits ^{c}} \mathsf {D}_2^{x y} \mathsf {c}^{x y}&\mathsf {D}_2^{x y}&{\mathop {\longrightarrow }\limits ^{d_2}}\mathsf {d}^{x z} \mathsf {d}^{z y} \end{aligned}$$

It can easily be verified that the function \(P\) defined by \(\mathsf {Z} \mapsto \varepsilon \), \(\mathsf {A} \mapsto O\), \(\mathsf {B} \mapsto I\), \(\mathsf {C} \mapsto I\), \(\mathsf {D}_1 \mapsto IO\), and \(\mathsf {D}_2 \mapsto IC\) satisfies the conditions of Definition 3. In particular, every \(\mathsf {D}_1\)-literal has profile IO, and every \(\mathsf {D}_2\)-literal has profile IC.

The general construction is straightforward: the simplest method is to create, for every nonterminal label B, all copies \(B_{P(B)}\) in which B is indexed with its \(3^{ arity (B)}\) possible profiles. Each rule for B is thus turned into \(3^{ arity (B)}\) rules, and the nonterminal literals in the right-hand sides are annotated according to Definition 3. Grappa turns this procedure around to avoid the exponential blow-up in most practically relevant cases, as follows. In all rules for a nonterminal label B, assume first that the profile of the left-hand side is \(I^{ arity (B)}\), and annotate it accordingly. Then annotate the nonterminal labels in all right-hand sides, again following Definition 3. This may give rise to a number of yet unseen annotated nonterminal labels. Create rules for them by copying the respective original rules as before, and repeat until no more new annotations are encountered.

Fig. 3.
figure 3

The characteristic finite automaton of \(\varGamma '\) in Example 2.

Figure 3 shows the CFA of grammar \(\varGamma '\). The start state is \(q_0\), indicated by the incoming arc out of nowhere. The CFA has been built in essentially the same way as the characteristic items and states for the steps when parsing the specific graph \(g_3\) in Fig. 2. Instead of the concrete nodes of \(g_3\), we now use parameters \(\varvec{a}\), \(\varvec{b}\), and \(\varvec{c}\). They are placeholders, which will be bound to nodes of the particular input graph during parsing. For instance, item \(\mathsf {A}^{x} {\mathop {\rightarrow }\limits ^{a}} \mathsf {a}^{x y} {\,\centerdot \,}\frac{x}{1} \frac{y}{2}\) that characterizes step 1 in Fig. 2 corresponds to state \(q_2^{\varvec{a}\varvec{b}}\) in Fig. 3 where \(\varvec{a}\) and \(\varvec{b}\) have been bound to nodes 1 and 2, respectively.

The transitions between states also refer to parameters. For instance, the transition from \(q_1^{\varvec{a}}\) to \(q_3^{\varvec{a}\varvec{b}}\) means that the dots in the two items \(\mathsf {D}_1^{x y} {\mathop {\rightarrow }\limits ^{d_1}} {\,\centerdot \,}\mathsf {d}^{x z} \mathsf {d}^{z y} \frac{x}{\varvec{a}}\) and \(\mathsf {D}_2^{x y} {\mathop {\rightarrow }\limits ^{d_2}} {\,\centerdot \,}\mathsf {d}^{x z} \mathsf {d}^{z y} \frac{x}{\varvec{a}}\) are moved across \(\mathsf {d}^{x z}\) where x is bound to \(\varvec{a}\) and z is yet unbound. The corresponding shift action must select and read a \(\mathsf {d}\)-edge in the input graph that leaves the node being bound to parameter \(\varvec{a}\). The node of the input graph that matches z and is bound to parameter \(\varvec{b}\) in state \(q_3^{\varvec{a}\varvec{b}}\), also becomes “known” that way. It must not have been read before because \(d_1\) and \(d_2\) are context-free rules. Figure 3 represents the fact that \(\varvec{b}\) is bound to the target node of the \(\mathsf {d}\)-edge by using \(\varvec{b}\) as the underlined target node, and the fact that this node has not been read before is indicated by the label “\(\varvec{b}\) is new”. Using the IOC normalform, this distinction between “new” and “old” nodes makes it possible to handle context nodes correctly (see also the discussion at the end of Sect. 3). It marks the major technical difference between the context-free parser and the contextual one.

The label “\(\varvec{c}\) is old” at the transition from \(q_3^{\varvec{a}\varvec{b}}\) to \(q_9^{\varvec{a}\varvec{b}\varvec{c}}\), however, indicates that \(\varvec{c}\) is bound to a node of the input graph that has already been read, together with a shifted edge, earlier in the parsing process. This situation can occur although this transition means moving the dot of item \(\mathsf {D}_2^{x y} {\mathop {\rightarrow }\limits ^{d_2}} \mathsf {d}^{x z} {\,\centerdot \,}\mathsf {d}^{z y} \frac{x}{\varvec{a}} \frac{z}{\varvec{b}}\) across \(\mathsf {d}^{z y}\) in the context-free rule \(d_2\). Node y of the nonterminal literal corresponds to context node y of the contextual rule c. This is reflected by the profile IC of \(\mathsf {D}_2\), which says that the second node of \(\mathsf {D}_2\) must be a context node. Further note that \(q_8^{\varvec{a}\varvec{b}\varvec{c}}\) also contains the same item as \(q_9^{\varvec{a}\varvec{b}\varvec{c}}\), but \(\varvec{c}\) is declared “new” by the corresponding transition. This is so because the context node may be still unread in this situation, and the parser can not yet distinguish whether it is currently processing rule \(d_2\) or \(d_1\), where y is not a context node, indicated by profile IO of \(\mathsf {D}_1\). This is demonstrated in the following.

Table 3. PSR parse of \(g_3\) using the CFA in Fig. 3.

Table 3 shows the only parse that a PSR parser using the CFA in Fig. 3 will try when analyzing graph \(g_3\). Note that this parse corresponds to the unique successful (non-predictive) shift-reduce parse among all possible attempts to parse graph \(g_3\), shown earlier in Table 1. It predicts the unique promising action that keeps it on track towards a successful parse. This is done by keeping, on its stack, an alternating sequence of CFA states and literals processed so far, for instance \(q_0 \, \mathsf {A}^{1} \, q_{1}^{1} \, \mathsf {d}^{1 4} \, q_{3}^{14} \, \mathsf {d}^{4 3} \, q_{8}^{143}\) in step 4. The stack contents represent a walk through the CFA from the initial state \(q_0\) to \(q_{8}^{143}\) via \(q_{1}^{1}\) and \(q_{3}^{14}\); the literals between consecutive states correspond to the transitions and their labels. When we ignore the states, the stack equals the stack of the (nondeterministic) shift-reduce parse shown in Table 1. The shift and reduce action of a PSR parser work as follows:

A shift action corresponds to an outgoing transition of the state which is currently on top of the stack. For instance, in step 3 with topmost state \(q_3^{14}\) there are two transitions leaving \(q_3^{\varvec{a}\varvec{b}}\). They both look for a \(\mathsf {d}\)-edge leaving node 4 in \(g_3\). The only such edge is \(\mathsf {d}^{4 3}\). And, the parser must choose the transition to \(q_8^{143}\) because node 3 is “new”, i.e., has not yet occurred in the parse.

A reduce action may be selected if the topmost state on the stack contains an item with the dot at the end. For instance, consider step 4 with topmost state \(q_8^{143}\). This state in fact contains two items with a dot at their ends: The parser may either reduce according to rule \(d_1\) or \(d_2\); the CFA cannot help the parser with this decision. However, further analysis (see the following section) reveals that only reducing \(d_2\) can lead to a successful parse. The parser, therefore, removes the topmost four elements from the stack (the right-hand side of rule \(d_2\) together with the states in between, indicated by the underline), leaving \(q_1^1\) as the intermediate topmost state. It then selects the transition for the obtained nonterminal literal \(\mathsf {D}_2^{1 3}\) that leaves \(q_1^1\), i.e., the transition to \(q_5^{13}\).

The PSR parser accepts \(g_3\) in step 9 because all literals of \(g_3\) have been read and the topmost state is \(q_7^1\) which has the dot at the end of rule z, i.e., a last reduction would produce \(\mathsf {Z}\).

Table 4. PSR parse of the invalid graph \(g_4\) using the CFA in Fig. 3.

Finally Table 4 shows that the PSR parser using CHR grammar \(\varGamma '\) in IOC normalform correctly recognizes that graph \(g_4\) is invalid. It fails in step 5 in state \(q_5^{12}\) where it looks for an unread literal \(\mathsf {c}^{1 2}\), but only finds \(\mathsf {b}^{1 2}\), which cannot be shifted, in contrast to the situation shown in Table 2.

5 Lookahead Analysis

The previous section revealed that the CFA does not always provide enough information for the parser to unambiguously select the next action. This is in fact unavoidable (at least if \(\mathrm P\ne \mathrm {NP}\)) because PSR parsing is very efficient while HR graph languages in general can be NP-complete. The problem is that the CFA may contain states with items that trigger different actions, for instance state \(q_8^{\varvec{a}\varvec{b}\varvec{c}}\) in Fig. 3, which allows two different reduce actions. Then we must analyze (prior to parsing) which literals may follow or cannot follow (immediately or later) in a correct parse when either of the possible actions is chosen. This may yield two results: The analysis may either identify a fixed number of (lookahead) literals that the parser must find among the unread literals in order to always predict the correct next action, or there is at least one state for which this is not possible. The latter situation is called a conflict. A CHR grammar is PSR only if no state has a conflict. Here, we describe this situation and the peculiarities for CHR grammars by means of grammar \(\varGamma '\) and state \(q_8^{\varvec{a}\varvec{b}\varvec{c}}\).

Consider the situation when the PSR parser has reached \(q_8^{\varvec{a}\varvec{b}\varvec{c}}\). We wish to know which literals will be read next or later if either “reduce \(d_1\)” or “reduce \(d_2\)” is chosen, producing \(\mathsf {D}_1^{\varvec{a} \varvec{c}}\) or \(\mathsf {D}_2^{\varvec{a} \varvec{c}}\). Figure 4 shows the history of the parser that is relevant in this situation. The parser is either in item \(I_6\) or \(I_7\). If it is in \(I_7\), it must also be in \(I_5\) where its literal \(\mathsf {D}_2^{x y}\) corresponds to the left-hand side of \(I_7\), and so on. \(\mathsf {B}^{x}\) in \(I_1\) corresponds to the left-hand side of either \(I_3\) or \(I_4\). Note that the node renamings in \(I_2,\ldots ,I_5\) reflect the information available when \(I_6\) or \(I_7\) are reduced. For instance, node y of \(I_5\) will be bound to parameter \(\varvec{c}\). However, the choice of context node y of \(I_5\) affects these situations. In this small example, y can be bound either to the node which y of \(I_2\) is bound to, or to the node which y of \(I_3\) is bound to. In the former case, y of \(I_2\) is bound to \(\varvec{c}\) since we already know that y of \(I_5\) is bound to \(\varvec{c}\); then, y of \(I_3\) must be a yet unread node, indicated by \(\frac{y}{\text {--}}\). Otherwise, y of \(I_3\) must be bound to \(\varvec{c}\), and y of \(I_2\) has already been read, but is not stored in any of the parameters, indicated by \(\frac{y}{\bullet }\).

Fig. 4.
figure 4

Lookahead analysis for state \(q_8^{\varvec{abc}}\) in Fig. 3.

It is clear from Fig. 4 that the parser must read a literal \(\mathsf {c}^{\varvec{a} \varvec{c}}\) next if it reduces \(I_7\), or more precisely, a \(\mathsf {c}\)-literal that is attached to the input nodes bound to parameters \(\varvec{a}\) and \(\varvec{c}\), respectively. This is so because \(\mathsf {c}^{x y}\) immediately follows \(\mathsf {D}_2^{x y}\) in \(I_5\) (indicated by a box). And, if the parser reduces \(I_6\), it must read \(\mathsf {b}^{\varvec{a} \varvec{c}}\) next. In fact, the parser must check whether there is a yet unread literal \(\mathsf {c}^{\varvec{a} \varvec{c}}\). If it finds one, it must reduce \(d_2\), otherwise it must reduce \(d_1\). To see this, assume there is an unread \(\mathsf {c}^{\varvec{a} \varvec{c}}\). This can be read when \(I_7\) is reduced, but never if \(I_6\) is reduced because no further literal would be read after \(\mathsf {b}^{\varvec{a} \varvec{c}}\). And if there is no \(\mathsf {c}^{\varvec{a} \varvec{c}}\), the parser would get stuck after reducing \(I_7\).

On the other hand, the parser cannot make a reliable decision based on the existence of just literal \(\mathsf {b}^{\varvec{a} \varvec{c}}\), because such a literal can be read by the parser if it chooses “reduce \(d_1\)”, but also if it chooses “reduce \(d_2\)”. The former is obvious. To see the latter, consider the literal read immediately after \(\mathsf {c}^{\varvec{a} \varvec{c}}\) when \(I_7\) has been reduced. This must be a literal that corresponds to \(\mathsf {b}^{x y}\) in \(I_3\), i.e., it is either \(\mathsf {b}^{\varvec{c} \text {--}}\) or \(\mathsf {b}^{\varvec{a} \varvec{c}}\) according to the possible renamings of \(I_3\). This means, \(\mathsf {b}^{\varvec{a} \varvec{c}}\) may in fact be read later if \(I_7\) is reduced. Note that this is only possible because the node used as y of \(I_3\) can be the context node used as y of \(I_5\).

6 Realization and Evaluation

PSR parsers for CHR grammars can be generated with the Grappa parser distiller (see footnote 1). Grappa checks whether the CHR grammar has the free edge choice property, which is not discussed in this paper. It ensures that, if the parser can end up in a conflict-like situation between shifting alternative edges, the choice will not affect the outcome of the parsing; see [13, Sect. 9] for details.

Parsing of an input graph starts with the identification of unique start nodes, i.e., a place in the graph where parsing has to start. (This is also not considered in this paper; see [11, Sect. 4] for details.) Then the parser uses the concepts outlined in the previous sections to find a shift-reduce parse of the input graph, and finally tries to construct a derivation from this parse so that context nodes are never used before they have been created in the derivation. However, it may turn out in this last step that the input graph is invalid although a parse has been found. This does happen if there are cyclic dependencies between derivation steps that create nodes and those that use such nodes as context nodes [7]. Identification of start nodes and finding a PSR parse are as efficient as in the context-free case. As discussed in [11, 13], this means that these steps require linear time in the size of the graph, for all practical purposes. So does creating a derivation from the parse by topological sorting. As a consequence, PSR parsing with respect to CHR grammars runs in linear time in the size of the input graph. However, a more detailed discussion must be omitted here due to lack of space.

We now use the more realistic language of flowcharts to evaluate PSR parsing for CHR grammars. Note that these “unstructured” flowcharts cannot be specified with HR grammars as they have unbounded treewidth.

Example 3

(Flowcharts). Flowcharts represent the control of low-level imperative programs. In the graphs representing these charts, nodes drawn as small black circles represent program states, unary edges labeled with \(\triangledown \) and \(\blacktriangle \) designate its unique start state and its stop states, resp., binary edges labeled with and represent instructions and jumps, resp., and ternary edges labeled with represent predicates, which select one of the following branches. (Here, we ignore the texts that usually occur in instructions and predicates.) Flowcharts can be generated by a CHR grammar [7, Ex. 2.6] as follows:

The context-free rules h, i, and c generate control flow trees of the halt, instruction, and conditional selection, respectively, and the fourth rule, j, which is not context-free, inserts jumps to some program location in the context. Figure 5 shows the diagrams of a derivation of a small flowchart.

Fig. 5.
figure 5

Derivation of a flowchart

Fig. 6.
figure 6

Definition of flowchart graphs \(F_n\) (left) and running time (in milliseconds) of different parsers analyzing \(F_n\) for varying numbers of n (right).

Grappa has been used to generate a PSR parser for this grammar. In order to evaluate its efficiency, we have generated three further parsers: a predictive top-down (PTD) parser for the same grammar (after extending PTD parsing [10] to CHR grammars), a Cocke-Younger-Kasami-style (CYK) parser [20], and finally a generalized predictive shift-reduce (GPSR) parser for structured flowcharts [18]. GPSR parsing extends PSR parsing to grammars that are not PSR, which is the case for structured flowcharts.

All four parsers have been used to parse flowcharts \(F_n\) as defined in Fig. 6 (left), which consist of n predicates, \(3n+1\) instructions, and n jumps. \(F_n\) has a subgraph \(D_n\), which, for \(n>0\), contains subgraphs \(D_m\) and \(D_{m'}\) with \(n = m + m' + 1\). Note that the predicates in \(F_n\) form a binary tree with n nodes when we ignore instructions. We always choose m and \(m'\) such that it is a complete binary tree. Note furthermore that each \(F_n\) forms in fact a structured flowchart, which must be built with jumps in our flowchart grammar. The GPSR parser has been applied to variations of \(F_n\) wherein jumps have been removed, and their source and target nodes have been identified.

Figure 6 shows the running time of each of the four parsers when analyzing graphs \(F_n\) with varying values of n. It was measured on a iMac 2017, 4.2 GHz Intel Core i7, Java 1.8.0_202 with standard configuration, and is shown in milliseconds on the y-axis while n is shown on the x-axis. The graphics shows nicely that the PSR parser is linear, and about four times faster than the PTD parser. The GPSR parser is much slower because it deals with conflicts of the CFA by following several parses simultaneously. The CYK parser is the slowest, because it builds up a table of nonterminal edges by dynamic programming.

Table 5. Key figures for parsers generated with Grappa. “Grammar” columns indicate maximal nonterminal arity (A), nonterminals (N) and terminals (T), context-free (\(R_{\text {cf}}\)) and contextual (\(R_{\text {c}}\)) rules, and the maximal length of right-hand sides (L). As the AMR grammar must be transformed into IOC normalform, its key figures are listed under “IOC normalform”. “CFA” columns count states (S), items (I), and transitions (\(\varDelta \)) in the CFAs. The last column counts the conflicts in the CFAs.

We have also created parsers for CHR grammars for two additional graph languages: Program graphs [7, Ex. 2.7] represent the syntactic structure of object-oriented programming and are used for refactoring. Abstract Meaning Representations are widely used in natural language processing to represent sentence meaning. To define the structure of AMRs, CHR grammars are preferable because of their greater capability to cover the set of all AMRs over a given domain. At the same time the grammars become both smaller and simpler. Unfortunately, the example grammar from [14] is not PSR because the CFA has 11 conflicts (see Table 5), but one can employ generalized PSR parsing introduced in [18]. Table 5 lists key figures of the three example languages outlined above.

7 Conclusions

In this paper, we have described how predictive shift-reduce parsing can be extended from HR grammars to CHR grammars. These parsers can be generated with Grappa (see footnote 1), and are as efficient as the context-free version, although they apply to a larger class of languages.

Related Work

Much work has been dedicated to graph parsing. Since space is limited, we mention only results and approaches for HR grammars. Early on, Lautemann [19] identified connectedness conditions which make polynomial parsing of certain HR languages possible, using a generalization of the Cocke-Younger-Kasami (CYK) algorithm for context-free string languages. However, the degree of the polynomial depends on the HR language. Stronger connectedness requirements yield cubic parsing (by a different kind of algorithm), as shown by Vogler and Drewes [6, 21]. A CYK parser for general HR grammars (even extended by so-called embedding rules) was implemented by Minas in DiaGen [20]. While this parser takes exponential time in general, it can handle graphs with hundreds of nodes and edges.

The line of work continued in this paper started with the proposal of predictive top-down (PTD) parsers in [10] and continued with the introduction of predictive shift-reduce (PSR) parsers [12, 13]. Both apply to suitable classes of HR grammars, while the current paper extends PSR parsers to CHR grammars.

Independently, Chiang et al. [4] improved the parser by Lautemann by making use of tree decompositions, and Gilroy et al. [15] studied parsing for the regular graph grammars by Courcelle [5]. Finally, Drewes et al. [2, 3] study a structural condition which enables very efficient uniform parsing.

Future Work

If a grammar has conflicts, a generalized parser can pursue all conflicting options in parallel until one of them yields a successful parse. This idea, which has been used for LR string parsing in the first place, has recently been transferred to HR grammars [18]. It turns out that generalized PSR parsing can be further extended to CHR grammars. This way the CHR grammar for abstract meaning representations analyzed in Table 5 can be recognized by a generalized PSR parser generated with Grappa.

So far, the matching of a context node in a host graph depends on the existence of a matching host node. So a contextual rule may causally depend on another rule that generates the required node. Example 1 showed that certain graphs cannot be derived with a rightmost derivation. This complicates the parser, which always constructs rightmost derivations, since it has to check whether causal dependencies have been respected. Since we conjecture that causal dependencies do not really extend the generative power of CHR grammars, we will consider to re-define CHR grammars, without causality. However, for the practical modeling of graph and diagram languages, it should be possible to express certain conditions that the host node of a context node should fulfill. For instance, one may wish to forbid that the application of a context rule introduces a cycle. This is why the initial version of CHR grammars introduced in [17] features contextual rules with application conditions that can express the existence or absence of certain paths in the host graph. We will investigate the ramifications of application conditions for parsing in the future.