Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Automata are recognizers used in various domains of applications especially in computer science, e.g. to represent (non necessarily finite) languages, or to solve the membership test, i.e. to verify whether a given element belongs to a language or not. Regular expressions are compact representations for these recognizers. Indeed, in the case where elements are words, it is well known that each regular expression can be transformed into a finite state machine recognizing the language it defines. Several methods have been proposed to realize this conversion. As an example, Glushkov [6] (and independently Mc-Naughton and Yamada [9]) showed how to construct a non deterministic finite automaton with \(n+1\) states where n represents the number of letters of a given regular expression. The main idea of the construction is to define some particular sets named \(\mathrm {First}\), \(\mathrm {Follow}\) and \(\mathrm {Last}\) that are computed with respect to the occurrences of the symbols that appear in the expression.

These so-called Glushkov automata (or position automata) are finite state machines that have been deeply studied. They have been structurally characterized by Caron and Ziadi [4], allowing us to invert the Glushkov computation by constructing an expression with n symbols from a Glushkov automaton with \(n+1\) states. They have been considered too in the theoretical notion of one-unambiguity by Bruggemann-Klein and Wood [3], characterizing regular languages recognized by a deterministic Glushkov automaton, or with practical thoughts, like expression updating [2]. Finally, it is also related to combinatorial research topics. As an example, Nicaud [12] proved that the average number of transitions of Glushkov automata is linear.

The Glushkov construction was extended to tree automata [8, 11], using a Top-Down interpretation of tree expressions. This interpretation can be problematic while considering determinism. Indeed, it is a folklore that there exist regular tree languages that cannot be recognized by Top-Down deterministic tree automata. Extensions of one-ambiguity are therefore incompatible with this approach.

In this paper, we propose a new approach based on the construction of Glushkov in a Bottom-Up interpretation. We also define a compressed version of tree automata in order to factorize the transitions, and we show how to apply it directly over the Glushkov computation using natural factorizations due to the structure of the expressions. The paper is structured as follows: in Sect. 2, we recall some properties related to regular tree expressions; we also introduce some basics definitions. We define, in Sect. 3, the position functions used for the construction of the Bottom-Up position tree automaton. Section 4 indicates the way that we construct the Bottom-Up position tree automaton with a linear number of states using the functions shown in Sect. 3. In Sect. 5, we propose the notion of compressed automaton and show how to reduce the size of the position automaton computed in the previous section.

2 Preliminaries

Let us first introduce some notations and preliminary definitions. For a boolean condition \(\psi \), we denote by \((E \mid \psi )\) E if \(\psi \) is satisfied, \(\emptyset \) otherwise. Let \(\varSigma =(\varSigma _n)_{n\ge 0}\) be a finite ranked alphabet. A tree t over \(\varSigma \) is inductively defined by \(t=f(t_1,\ldots ,t_k)\) where \(f\in \varSigma _k\) and \(t_1,\ldots ,t_k\) are k trees over \(\varSigma \). The relation “s is a subtree of t” is denoted by \(s\prec t\) for any two trees s and t. We denote by \(\mathrm {root}(t)\) the root symbol of the tree t, i.e.

$$\begin{aligned} \mathrm {root}(f(t_1,\ldots ,t_k))=f. \end{aligned}$$
(1)

The predecessors of a symbol f in a tree t are the symbols that appear directly above it. We denote by \(\mathrm {father}(t,f)\), for a tree t and a symbol f the pairs

$$\begin{aligned} \mathrm {father}(t,f) = \{(g,i)\in \varSigma _l\times \mathbb {N} \mid \exists g(s_1,\ldots ,s_l)\prec t, \mathrm {root}(s_i)=f\}. \end{aligned}$$
(2)

These couples link the predecessors of f and the indices of the subtrees in t that f is the root of. Let us consider a tree \(t=g(t_1,\ldots ,t_k)\) and a symbol f. By definition of the structure of a tree, a predecessor of f in t is a predecessor of f in a subtree \(t_i\) of t, or g if f is a root of a subtree \(t_i\) of t. Consequently:

$$\begin{aligned} \mathrm {father}(t,f) = \bigcup _{i\le n} \mathrm {father}(t_i,f) \cup \{(g,i)\mid f\in \mathrm {root}(t_i)\}. \end{aligned}$$
(3)

We denote by \(T_\varSigma \) the set of trees over \(\varSigma \). A tree language L is a subset of \(T_\varSigma \).

For any 0-ary symbol c, let \(t \cdot _c L\) denote the tree language constituted of the trees obtained by substitution of any symbol c of t by a tree of L. By a linear extension, we denote by \(L \cdot _c L' = \{t \cdot _c L' \mid t\in L \}\). For an integer n, the n-th substitution \(^{c,n}\) of a language L is the language \(L^{c,n}\) recursively defined by

$$\begin{aligned} L^{c,n}&= {\left\{ \begin{array}{ll} \{c\}, &{} \text { if } n = 0,\\ L \cdot _c L^{c, n - 1} &{} \text { otherwise.} \end{array}\right. } \end{aligned}$$

Finally, we denote by \(L(E_1^{*_c})\) the language \( \bigcup _{k\ge 0} L(E_1)^{c,k}\).

An automaton over \(\varSigma \) is a 4-tuple \(\mathrm {A}=(Q,\varSigma ,Q_F,\delta )\) where Q is a set of states, \(Q_F\subseteq Q\) is the set of final states, and \(\delta \subset \bigcup _{k\ge 0} (Q^k\times \varSigma _k\times Q)\) is the set of transitions, which can be seen as the function from \(Q^k \times \varSigma _k\) to \(2^Q\) defined by

$$\begin{aligned} (q_1,\ldots ,q_k,f,q) \in \delta \Leftrightarrow q \in \delta (q_1,\ldots ,q_k,f). \end{aligned}$$

It can be linearly extended as the function from \((2^{Q})^k \times \varSigma _k\) to \(2^Q\) defined by

$$\begin{aligned} \delta (Q_1,\ldots ,Q_n,f) = \displaystyle \bigcup _{(q_1,\ldots ,q_n)\in Q_1\times \cdots Q_n} \delta (q_1,\ldots ,q_n,f). \end{aligned}$$
(4)

Finally, we also consider the function \(\varDelta \) from \(T_{\varSigma }\) to \(2^Q\) defined by

$$\begin{aligned} \varDelta (f(t_1,\ldots ,t_n)) = \delta (\varDelta (t_1),\ldots ,\varDelta (t_n),f). \end{aligned}$$

Using these definitions, the language recognized by the automaton A is the language \(\{t\in T_{\varSigma } \mid \varDelta (t)\cap Q_F \ne \emptyset \}\).

A regular expression E over the alphabet \(\varSigma \) is inductively defined by:

$$\begin{aligned} E&=f(E_1,\ldots ,E_k),&E&= E_1+E_2,\\ E&=E_1\cdot _c E_2,&E&=E_1^{*_c}, \end{aligned}$$

where \(k\in \mathbb {N}\), \(c\in \varSigma _0\), \(f\in \varSigma _k\) and \(E_1,\ldots ,E_k\) are any k regular expressions over \(\varSigma \). In what follows, we consider expressions where the subexpression \(E_1\cdot _c E_2\) only appears when c appears in the expression \(E_1\). The language denoted by E is the language L(E) inductively defined by

$$\begin{aligned} L(f(E_1,\ldots ,E_k))&= \{f(t_1,\ldots ,t_k)\mid t_j \in L(E_j), j\le k\},\\ L(E_1+E_2)&= L(E_1) \cup L(E_2),\\ L(E_1\cdot _c E_2)&= L(E_1) \cdot _c L(E_2),\\ L(E_1^{*_c})&= L(E_1) ^{*_c}, \end{aligned}$$

with \(k\in \mathrm {N}\), \(c\in \varSigma _0\), \(f\in \varSigma _k\) and \(E_1,\ldots ,E_k\) any k regular expressions over \(\varSigma \).

A regular expression E is linear if each symbol \(\varSigma _{n}\) with \(n\ne 0\) occurs at most once in E. Note that the symbols of rank 0 may appear more than once. We denote by \(\overline{E}\) the linearized form of E, which is the expression E where any occurrence of a symbol is indexed by its position in the expression. The set of indexed symbols, called positions, is denoted by \(\mathrm {Pos}(\overline{E})\). We also consider the delinearization mapping \(\mathrm {h}\) sending a linearized expression over its original unindexed version.

Let \(\phi \) be a function between two alphabets \(\varSigma \) and \(\varSigma '\) such that \(\phi \) sends \(\varSigma _n\) to \(\varSigma '_n\) for any integer n. By a well-known adjunction, this function is extended to an alphabetical morphism from \(T(\varSigma )\) to \(T(\varSigma ')\) by setting \(\phi (f(t_1,\ldots ,t_n)) = \phi (f)(\phi (t_1),\ldots ,\phi (t_n))\). As an example, one can consider the delinearization morphism \(\mathrm {h}\) that sends an indexed alphabet to its unindexed version. Given a language L, we denote by \(\phi (L)\) the set \(\{\phi (t)\mid t\in L\}\). The image by \(\phi \) of an automaton \(A=(\varSigma ,Q,Q_F,\delta )\) is the automaton \(\phi (A)=(\varSigma ',Q,Q_F,\delta ')\) where

$$\begin{aligned} \delta ' = \{(q_1,\ldots ,q_n,\phi (f),q) \mid (q_1,\ldots ,q_n,f,q) \in \delta \}. \end{aligned}$$

By a trivial induction over the structure of the trees, it can be shown that

$$\begin{aligned} \phi (L(A)) = L(\phi (A)). \end{aligned}$$
(5)

3 Position Functions

In this section, we define the position functions that are considered in the construction of the Bottom-Up automaton in the next sections. We show how to compute them and how they characterize the trees in the language denoted by a given expression.

Let E be a linear expression over a ranked alphabet \(\varSigma \) and f be a symbol \(\in \varSigma _k\). The set \(\mathrm {Root}(E)\), subset of \(\varSigma \), contains the roots of the trees in L(E), i.e.

$$\begin{aligned} \mathrm {Root}(E) = \{\mathrm {root}(t) \mid t\in L(E)\}. \end{aligned}$$
(6)

The set \(\mathrm {Father}(E,f)\), subset of \(\varSigma \times \mathbb {N}\), contains a couple (gi) if there exists a tree in L(E) with a node labeled by g the i-th child of is a node labeled by f:

$$\begin{aligned} \mathrm {Father}(E,f)=\bigcup _{t\in L(E)} \mathrm {father}(t,f). \end{aligned}$$
(7)

Example 1

Let us consider the ranked alphabet defined by \(\varSigma _2=\{f\}\), \(\varSigma _1=\{g\}\), and \(\varSigma _0=\{a,b\}\). Let E and \(\overline{E}\) be the expressions defined by

$$\begin{aligned} E = (f(a,a)+g(b))^{*_a}\cdot _b f(g(a),b), \quad \overline{E} = (f_1(a,a)+g_2(b))^{*_a}\cdot _b f_3(g_4(a),b). \end{aligned}$$

Hence,

figure a

Let us show how to inductively compute these functions.

Lemma 1

Let E be a linear expression over a ranked alphabet \(\varSigma \). The set \(\mathrm {Root}(E)\) is inductively computed as follows:

$$\begin{aligned} \mathrm {Root}(f(E_1,...,E_n))&= \{f\},\\ \mathrm {Root}(E_1+E_2)&= \mathrm {Root}(E_1)\cup \mathrm {Root}(E_2),\\ \mathrm {Root}(E_1\cdot _c E_2)&= {\left\{ \begin{array}{ll} \mathrm {Root}(E_1)\setminus \{c\}\cup \mathrm {Root}(E_2)&{} {\textit{if}}\; c\in L(E_1),\\ \mathrm {Root}(E_1) &{} {\textit{otherwise},} \end{array}\right. }\\ \mathrm {Root}(E_1^{*_c})&= \mathrm {Root}(E_1)\cup \{c\}, \end{aligned}$$

where \(E_1,\ldots ,E_n\) are n regular expressions over \(\varSigma \), f is a symbol in \(\varSigma _n\) and c is a symbol in \(\varSigma _0\).

Lemma 2

Let E be a linear expression and f be a symbol in \(\varSigma _k\). The set \(\mathrm {Father}(E,f)\) is inductively computed as follows:

$$\begin{aligned} \mathrm {Father}(g(E_1,...,E_n),f)&= \bigcup _{i\le n} \mathrm {Father}(E_i,f) \cup \{(g,i)\mid f\in \mathrm {Root}(E_i)\},\\ \mathrm {Father}(E_1+E_2,f)&= \mathrm {Father}(E_1,f)\cup \mathrm {Father}(E_2,f),\\ \mathrm {Father}(E_1\cdot _c E_2,f)&= (\mathrm {Father}(E_1,f) \mid f \ne c) \cup \mathrm {Father}(E_2,f)\\&\quad \cup \, (\mathrm {Father}(E_1,c) \mid f\in \mathrm {Root}(E_2))\\ \mathrm {Father}(E_1^{*_c},f)&= \mathrm {Father}(E_1,f) \cup (\mathrm {Father}(E_1,c) \mid f\in \mathrm {Root}(E_1)), \end{aligned}$$

where \(E_1,\ldots ,E_n\) are n regular expressions over \(\varSigma \), g is a symbol in \(\varSigma _n\) and c is a symbol in \(\varSigma _0\).

Proof (partial)

Let us consider the following cases.

(1) :

Let us consider a tree \(t=t_1 \cdot _c L(E_2)\) with \(t_1\in L(E_1)\). By definition, t equals \(t_1\) where the occurrences of c have been replaced by some trees \(t_2\) in \(L(E_2)\). Two cases may occur. (a) If \(c \ne f\), then a predecessor of the symbol f in t can be a predecessor of the symbol f in a tree \(t_2\) in \(L(E_2)\), a predecessor of the symbol f in \(t_1\), or a predecessor of c in \(t_1\) if an occurrence of c in \(t_1\) has been replaced by a tree \(t_2\) in \(L(E_2)\) the root of which is f. (b) If \(c = f\), since the occurrences of c have been replaced by some trees \(t_2\) of \(L(E_2)\), a predecessor of the symbol c in t can be a predecessor of the symbol c in a tree \(t_2\) in \(L(E_2)\), or a predecessor of c in \(t_1\) if an occurrence of c has been replaced by itself (and therefore if it appears in \(L(E_2)\)). In both of these two cases, we conclude using Eqs. (3) and (6).

(2) :

By definition, \(L(E_1^{*_c}) = \bigcup _{k\ge 0} L(E_1)^{c,k}\). Therefore, a tree t in \(L(E_1^{*_c})\) is either c or a tree \(t_1\) in \(L(E_1)\) where the occurrences of c have been replaced by some trees \(t_2\) in \(L(E_1)^{c,k}\) for some integer k. Let us then proceed by recursion over k. If \(k = 1\), a predecessor of f in t is a predecessor of f in \(t_1\), a predecessor of f in a tree \(t_2\) in \(L(E_1)^{c,1}\) or a predecessor of c in \(t_1\) if an occurrence of c in \(t_1\) was substituted by a tree \(t_2\) in \(L(E_1)^{c,1}\) the root of which is f, i.e.

$$\begin{aligned} \mathrm {Father}(E_1^{c,2},f) = \mathrm {Father}(E_1,f) \cup (\mathrm {Father}(E_1,c) \mid f\in \mathrm {Root}(E_1)). \end{aligned}$$

By recursion over k and with the same reasoning, each recursion step adds \(\mathrm {Father}(E_1,f)\) to the result of the previous step, and therefore

$$\begin{aligned} \mathrm {Father}(E_1^{c,k},f) = \mathrm {Father}(E_1,f) \cup (\mathrm {Father}(E_1,c) \mid f\in \mathrm {Root}(E_1)). \end{aligned}$$

    \(\square \)

Let us now show how these functions characterize, for a tree t, the membership of t in the language denoted by an expression.

Definition 1

Let E be a linear expression over a ranked alphabet \(\varSigma \) and t be a tree in \(T(\varSigma )\). The property P(t) is the property defined by

$$\begin{aligned} \forall s=f(t_1,\ldots ,t_n) \prec t, \forall i\le n, (f,i)\in \mathrm {Father}(E,\mathrm {root}(t_i)). \end{aligned}$$

Proposition 1

Let E be a linear expression over a ranked alphabet \(\varSigma \) and t be a tree in \(T(\varSigma )\). Then (1) t is in L(E) if and only if (2) \(\mathrm {root}(t)\) is in \(\mathrm {Root}(E)\) and P(t) is satisfied.

Proof (partial)

Let us first notice that the proposition \(1 \Rightarrow 2\) is direct by definition of \(\mathrm {Root}\) and \(\mathrm {Father}\). Let us show the second implication by induction over the structure of E. Hence, let us suppose that \(\mathrm {root}(t)\) is in \(\mathrm {Root}(E)\) and P(t) is satisfied.

(1) :

Let us consider the case when \(E=E_1\cdot _c E_2\). Let us first suppose that \(\mathrm {root}(t)\) is in \(\mathrm {Root}(E_2)\). Then c is in \(L(E_1)\) and P(t) is equivalent to

$$\begin{aligned} \forall s=f(t_1,\ldots ,t_n) \prec t, \forall i\le n, (f,i)\in \mathrm {Father}(E_2,\mathrm {root}(t_i)). \end{aligned}$$

By induction hypothesis t is in \(L(E_2)\) and therefore in L(E).

Let us suppose now that \(\mathrm {root}(t)\) is in \(\mathrm {Root}(E_1)\). Since E is linear, let us consider the subtrees \(t_2\) of t with only symbols of \(E_2\) and a symbol of \(E_1\) as a predecessor in t. Since P(t) holds, according to induction hypothesis and Lemma 2, each of these trees belongs to \(L(E_2)\). Hence t belongs to \(t_1 \cdot _c \, L(E_2)\) where \(t_1\) is equal to t where the previously defined \(t_2\) trees are replaced by c. Once again, since P(t) holds and since \(\mathrm {root}(t)\) is in \(\mathrm {Root}(E_1)\), \(t_1\) belongs to \(L(E_1)\).

In these two cases, t belongs to L(E).

(2) :

Let us consider the case when . Let us proceed by induction over the structure of t. If \(t=c\), the proposition holds from Lemmas 1 and 2. Following Lemma 2, each predecessor of a symbol f in t is a predecessor of f in \(E_1\) (case 1) or a predecessor of c in \(E_1\) (case 2). If all the predecessors of the symbols satisfy the case 1, then by induction hypothesis t belongs to \(L(E_1)\) and therefore to L(E). Otherwise, we can consider (similarly to the catenation product case) the smallest subtrees \(t_2\) of t the root of which admits a predecessor in t which is a predecessor of c in \(E_1\). By induction hypothesis, these trees belong to \(L(E_1)\). And consequently t belongs to \(t' \cdots L(E_1)\) where \(t'\) is equal to t where the subtrees \(t_2\) have been substituted by c. Once again, by induction hypothesis, \(t'\) belongs to \(L(E_1^{*_c})\). As a direct consequence, t belongs to L(E).    \(\square \)

4 Bottom-Up Position Automaton

In this section, we show how to compute a Bottom-Up automaton with a linear number of states from the position functions previously defined.

Definition 2

The Bottom-Up position automaton \(\mathcal {P}_{E}\) of a linear expression E over a ranked alphabet \(\varSigma \) is the automaton \((\varSigma ,\mathrm {Pos}(E),\mathrm {Root}(E),\delta )\) defined by:

$$\begin{aligned} ((f_1,\ldots ,f_n),g,g) \in \delta \Leftrightarrow \forall i \le n, (g,i)\in \mathrm {Father}(E,f_i). \end{aligned}$$

Example 2

The Bottom-Up position automaton \((\mathrm {Pos}(\overline{E}),\mathrm {Pos}(\overline{E}),\mathrm {Root}(\overline{E}),\delta )\) of the expression \(\overline{E}\) defined in Example 1 is defined as follows:

$$\begin{aligned} \mathrm {Pos}(E) = \{a,b,f_1,g_2,f_3,g_4\}, \mathrm {Root}(\overline{E}) = \{a,f_1,g_2\}, \end{aligned}$$
$$\begin{aligned} \delta =&\{(a,a), (b,b), ((a,a),f_1,f_1), ((a,f_1),f_1,f_1), ((a,g_2),f_1,f_1), ((f_1,a),f_1,f_1),\\&\;\; ((f_1,f_1),f_1,f_1), ((f_1,g_2),f_1,f_1), ((g_2,a),f_1,f_1), ((g_2,f_1),f_1,f_1),\\&\;\; ((g_2,g_2),f_1,f_1), (f_3, g_2,g_2),((b,g_4),f_3,f_3), (a,g_4,g_4)\} \end{aligned}$$

Let us now show that the position automaton of E recognizes L(E).

Lemma 3

Let \(\mathcal {P}_E=(\varSigma ,Q,Q_F,\delta )\) be the Bottom-Up position automaton of a linear expression E over a ranked alphabet \(\varSigma \), t be a tree in \(T_\varSigma \) and f be a symbol in \(\mathrm {Pos}(E)\). Then (1) \(f\in \varDelta (t)\) if and only if (2) \(\mathrm {root}(t) = f \wedge P(t)\).

Proof

Let us proceed by induction over the structure of \(t=f(t_1,\ldots ,t_n)\). By definition, \(\varDelta (t) = \delta (\varDelta (t_1),\ldots ,\varDelta (t_n),f)\). For any state \(f_i\) in \(\varDelta _i\), it holds from the induction hypothesis that

figure b

Then, suppose that (1) holds (i.e. \(f\in \varDelta (t)\)). Equivalently, there exists by definition of \(\mathcal {P}_E\) a transition \(((f_1,\ldots ,f_n),f,f)\) in \(\delta \) such that \(f_i\) is in \(\varDelta (t_i)\) for any integer \(i\le n\). Consequently, f is the root of t. Moreover, from the equivalence stated in Eq. (), \(\mathrm {root}(t_i) = f_i \) and \(P(t_i)\) holds for any integer \(i\le n\). Finally and equivalently, P(t) holds as a consequence of Eq. (3). The reciprocal condition can be proved similarly since only equivalences are considered.     \(\square \)

As a direct consequence of Lemma 3 and Proposition 1.

Proposition 2

The Bottom-Up position automaton of a linear expression E recognizes L(E).

The Bottom-Up position automaton of a (not necessarily linear) expression E can be obtained by first computing the Bottom-Up position automaton of its linearized expression \(\overline{E}\) and then by applying the alphabetical morphism \(\mathrm {h}\). As a direct consequence of Eq. (5).

Proposition 3

The Bottom-Up position automaton of an expression E recognizes L(E).

5 Compressed Bottom-Up Position Automaton

In this section, we show that the structure of an expression allows us to factorize the transitions of a tree automaton by only considering the values of the \(\mathrm {Father}\) function. The basic idea of the factorizations is to consider the cartesian product of sets. Imagine that a tree automaton contains four binary transitions \((q_1,q_1,f,q_3)\), \((q_1,q_2,f,q_3)\), \((q_2,q_1,f,q_3)\) and \((q_2,q_2,f,q_3)\). These four transitions can be factorized as a compressed transition \((\{q_1,q_2\},\{q_1,q_2\},f,q_3)\) using set of states instead of sets. The behavior of the original automaton can be simulated by considering the cartesian product of the origin states of the transition.

We first show how to encode such a notion of compressed automaton and how it can be used in order to solve the membership test.

Definition 3

A compressed tree automaton over a ranked alphabet \(\varSigma \) is a 4-tuple \((\varSigma ,Q,Q_F,\delta )\) where Q is a set of states, \(Q_F\subset Q\) is the set of final states, \(\delta \subset (2^Q)^n\times \varSigma _n\times 2^Q\) is the set of compressed transitions that can be seen as a function from \((2^{Q})^k\times \varSigma _k\) to \(2^Q\) defined by

$$\begin{aligned} (Q_1,\ldots ,Q_k,f,q)\in \delta \Leftrightarrow q\in \delta (Q_1,\ldots ,Q_k,f). \end{aligned}$$

Example 3

Let us consider the compressed automaton \(A=(\varSigma ,Q,Q_F,\delta )\) shown in Fig. 1. Its transitions are

$$\begin{aligned} \delta = \{ (\{1,2,5\},\{3,&4\},f,1), (\{2,3,5\},\{4,6\},f,2),\\&(\{1,2\},\{3\},f,5),(\{6\},g,4), (\{6\},g,5), (a,6), (a,4), (b,3)\}. \end{aligned}$$
Fig. 1.
figure 1

The compressed automaton A.

The transition function \(\delta \) can be restricted to a function from \(Q^n\times \varSigma _n\) to \(2^Q\) (e.g. in order to simulate the behavior of an uncompressed automaton) by considering for a tuple \((q_1,\ldots ,q_k)\) of states and a symbol f in \(\varSigma _k\) all the “active” transitions \((Q_1,\ldots ,Q_k,f,q)\), that are the transitions where \(q_i\) is in \(Q_i\) for \(i\le k\). More formally, for any states \((q_1,\ldots ,q_k)\) in \(Q^k\), for any symbol f in \(\varSigma _k\),

$$\begin{aligned} \delta (q_1,\ldots ,q_k,f) = \bigcup _{\begin{array}{c} (Q_1,\ldots , Q_k, f,q)\in \delta ,\\ \forall i\le k, q_i\in Q_i \end{array}} \{q\}. \end{aligned}$$
(8)

The transition set \(\delta \) can be extended to a function \(\varDelta \) from \(T(\varSigma )\) to \(2^Q\) by inductively considering, for a tree \(f(t_1,\ldots ,t_k)\) the “active” transitions \((Q_1,\ldots ,Q_k,\) fq) once a subtree is read, that is when \(\varDelta (q_i)\) and \(Q_i\) admits a common state for \(i\le k\). More formally, for any tree \(t=f(t_1,\ldots ,t_k)\) in \(T(\varSigma )\),

$$\begin{aligned} \varDelta (t) = \bigcup _{\begin{array}{c} (Q_1,\ldots ,Q_k,f,q)\in \delta ,\\ \forall i\le k, \varDelta (t_i)\cap Q_i\ne \emptyset \end{array}} \{q\}. \end{aligned}$$

As a direct consequence of the two previous equations,

$$\begin{aligned} \varDelta (f(t_1,\ldots ,t_n)) = \bigcup _{(q_1,\ldots ,q_n)\in \varDelta (t_1)\times \cdots \times \varDelta (t_n)} \delta (q_1,\ldots ,q_n,f). \end{aligned}$$
(9)

The language recognized by a compressed automaton \(A=(\varSigma ,Q,Q_F,\delta )\) is the subset L(A) of \(T(\varSigma )\) defined by

$$\begin{aligned} L(A) = \{t\in T(\varSigma ) \mid \varDelta (t)\cap Q_F\ne \emptyset \}. \end{aligned}$$

Example 4

Let us consider the automaton of Fig. 1 and let us show that the tree \(t=f(f(b,a),g(a))\) belongs to L(A). In order to do so, let us compute \(\varDelta (t')\) for each subtree \(t'\) of t. First, by definition,

$$\begin{aligned} \varDelta (a)&=\{4,6\}, \qquad&\varDelta (b)&=\{3\}. \end{aligned}$$

Since the only transition in \(\delta \) labeled by f containing 3 in its first origin set and 4 or 6 in its second is the transition \((\{2,3,5\},\{4,6\},f,2)\),

$$\begin{aligned} \varDelta (f(b,a)) = \{2\}. \end{aligned}$$

Since the two transitions labeled by g are \((\{6\},g,4)\) and \((\{6\},g,5)\),

$$\begin{aligned} \varDelta (g(a)) = \{4,5\}. \end{aligned}$$

Finally, there are two transitions labeled by f containing 2 in their first origin and 4 or 5 in its second: \((\{2,3,5\},\{4,6\},f,2)\) and \((\{1,2,5\},\{3,4\},f,1)\). Therefore

$$\begin{aligned} \varDelta (f(f(b,a),g(a))) = \{1,2\}. \end{aligned}$$

Finally, since 1 is a final state, \(t\in L(A)\).

Let \(\phi \) be an alphabetical morphism between two alphabets \(\varSigma \) and \(\varSigma '\). The image by \(\phi \) of a compressed automaton \(A=(\varSigma ,Q,Q_F,\delta )\) is the compressed automaton \(\phi (A)=(\varSigma ',Q,Q_F,\delta ')\) where

$$\begin{aligned} \delta '=\{(Q_1,\ldots ,Q_n,\phi (f),q)\mid (Q_1,\ldots ,Q_n,f,q) \in \delta \}. \end{aligned}$$

By a trivial induction over the structure of the trees, it can be shown that

$$\begin{aligned} L(\phi (A)) = \phi (L(A)). \end{aligned}$$
(10)

Due to their inductive structure, regular expressions are naturally factorizing the structure of transitions of a Glushkov automaton. Let us now define the compressed position automaton of an expression.

Definition 4

The compressed Bottom-Up position automaton \(\mathcal {C}(E)\) of a linear expression E is the automaton \((\varSigma ,\mathrm {Pos}(E),\mathrm {Root}(E),\delta )\) defined by

$$\begin{aligned} \delta = \{(Q_1,\ldots ,Q_k,f,\{f\}) \mid Q_i = \{g \mid (f,i) \in \mathrm {Father}(E,g)\} \}. \end{aligned}$$

Example 5

Let us consider the expression \(\overline{E}\) defined in Example 1. The compressed automaton of \(\overline{E}\) is represented at Fig. 2.

Fig. 2.
figure 2

The compressed automata of the expression \((f_1(a,a)+g_2(b))^{*_a}\cdot _b f_3(g_4(a),b)\).

As a direct consequence of Definition 4 and of Eq. (8),

Lemma 4

Let E be a linear expression over a ranked alphabet \(\varSigma \). Let \(\mathcal {C}(E) =(\varSigma ,Q,Q_F,\delta )\). Then, for any states \((q_1,\ldots ,q_n)\) in \(Q^n\), for any symbol f in \(\varSigma _k\),

$$\begin{aligned} \delta (q_1,\ldots ,q_n,f) = \{f\} \Leftrightarrow \forall i \le n, (f,i) \in \mathrm {Father}(E,q_i). \end{aligned}$$

Consequently, considering Definition 2, Lemma 4 and Eq. (9),

Proposition 4

Let E be a linear expression over a ranked alphabet \(\varSigma \). Let \(\mathcal {P}_E=(\_,\_,\_,\delta )\) and \(\mathcal {C}(E) =(\_,\_,\_,\delta ')\). For any tree t in \(T(\varSigma )\),

$$\begin{aligned} \varDelta (t) = \varDelta '(t). \end{aligned}$$

Since the Bottom-Up position automaton of a linear expression E and its compressed version have the same states and the same final states,

Corollary 1

The Glushkov automaton of an expression and its compact version recognize the same language.

The compressed Bottom-Up position automaton of a (not necessarily linear) expression E can be obtained by first computing the compressed Bottom-Up position automaton of its linearized expression \(\overline{E}\) and then by applying the alphabetical morphism \(\mathrm {h}\). Therefore, considering Eq. (10),

Proposition 5

The compressed Bottom-Up position automaton of a regular expression E recognizes L(E).

6 Web Application

The computation of the position functions and the Glushkov constructions have been implemented in a web application (made in Haskell, compiled into Javascript using the reflex platform, represented with viz.js) in order to help the reader to manipulate the notions. From a regular expression, it computes the classical Top-Down Glushkov defined in [8], and both the normal and the compressed versions of the Glushkov Bottom-Up automaton.

This web application can be found here [10]. As an example, the expression \((f(a,a)+g(b))^{*_a}\cdot _bf(g(a),b)\) of Example 1 can be defined from the literal input (f(a,a)+g(b))*a.bf(g(a),b).

7 Conclusion and Perspectives

In this paper, we have shown how to compute the Bottom-Up position automaton associated with a regular expression. This construction is relatively similar to the classical one defined over a word expression [6]. We have also proposed a reduced version, the compressed Bottom-Up position automaton, that can be easily defined for word expressions too.

Since this construction is related to the classical one, one can wonder if all the studies involving Glushkov word automata can be extended to tree ones ([2,3,4, 12]). The classical Glushkov construction was also studied via its morphic links with other well-known constructions. The next step of our study is to extend Antimirov partial derivatives [1] in a Bottom-Up way too (in a different way from [7]), using the Bottom-Up quotient defined in [5].