1 Introduction

The present paper follows and refines an approach for XML schema inference from positive examples that was introduced by Bex et al. [4]. The basic problem setting is as follows. Given a set of XML documents, generate a schema that describes these documents, while being compact and preferably human readable.

Bex et al. approach this problem by learning deterministic regular expressions from positive examples; i. e., they consider the following problem: Given a finite set S of positive examples from an unknown target language L, find a deterministic regular expression for L. These regular expressions can immediately be used as element type declarations in DTDs (Document Type Definitions), and while XSDs (XML Schema Documents) require additional effort, algorithms that infer regular expressions can also be used as a component of XSD inference algorithms (see [4, 5] for further explanations). In particular, as argued in [4], the results in [16] show that XSD inference requires deep insights into regular expression inference – as Bex et al. put it, “one cannot hope to successfully infer XSDs without good algorithms for inferring regular expressions”.

Using a classical technique from Gold [12], Bex et al. prove in [3] that even the class of deterministic regular expressions is too rich to be learnable from positive data. While, strictly speaking, the learnability criterion of Gold-style learning as defined in [12] (which is also called learning in the limit from positive data or explanatory learning) is different from the setting in [3, 4],Footnote 1 its non-learnability results still provide valuable insights into necessary restrictions. In particular, Gold-style learning shows that, when learning from positive data, one has to balance the need for generalization (as in most cases, a regular expression that generates exactly the example is not considered a good hypothesis) with the need to avoid overgeneralization.

While there are numerous papers on restrictions on the class of regular languages that lead to learnability, apart from a few exceptions (e. g. [7]), most of these restrictions prior to [4] have been based on properties of automata. As explained in [4], this is problematic, as even under those restrictions, converting the inferred automaton to a regular expression can lead to an exponential size increase. In order to achieve learnability of concise deterministic regular expression, Bex et al. propose single-occurrence regular expressions (short Sores), regular expressions where each terminal letter (or element name) occurs at most once. These sores are deterministic by definition, and as an additional benefit, this restriction ensures that the length of the inferred expressions is at most linear in the number of different terminal letters.

The corresponding Sore inference algorithm RWR from [4] works as follows. First, it constructs a so-called single-occurrence automaton (short Soa, as introduced by García and Vidal [11]). RWR then attempts to convert the Soa step by step into a Sore. As the class of Sore languages is a proper subset of the class of Soa languages, this conversion is not always possible. In these cases, RWR attempts to repair the Soa, and constructs a Sore that generates a generalization of the language of the Soa. In order to generalize as little as possible, [4] suggests different orderings on the set of repair rules, as well as the variant RWR \(^{2}_{\ell }\), which uses additional heuristics and can have an exponential running time. Nonetheless, these variants may still infer Sores that are not inclusion-minimal generalizations of the input sample (within the class of all Sores).

In order to deal with insufficient data, Bex et al. propose a further restriction on Sores, the so-called chain regular expressions (short: Chares), and introduce the corresponding inference algorithm CRX. Analogously to RWR, CRX may infer Chares that are not inclusion-minimal generalizations.

The present paper focuses on inferring Sores and Chares that are inclusion-minimal generalizations. This approach to regular expression inference is based on a slightly different angle than Gold-style learning, namely on the learning paradigm of descriptive generalization that was introduced by Freydenberger and Reidenbach [10].

While Gold-style learning usually assumes that an exact representation of the target language is present in the hypothesis space, and that the learner is provided with sufficient positive information to correctly recognize the target language, descriptive generalization views the hypothesis space and the space of target languages as distinct.

For a class \(\mathcal {D}\) of language representation mechanisms (e. g., a class of automata, regular expressions, or grammars), a language representation \(\delta \in \mathcal {D}\) is called \(\mathcal {D}\)-descriptive of a sample S if its language \(\mathcal {L}(\delta )\) is an inclusion-minimal generalization of S, i. e., \(S \subseteq \mathcal {L}(\delta )\) and there is no \(\gamma \in \mathcal {D}\) with \(S \subseteq \mathcal {L}(\gamma ) \subset \mathcal {L}(\delta )\). To the authors’ knowledge, the first class \(\mathcal {D}\) for which the existence of descriptive representations was examined is the class of NE-patterns, where Angluin [1] introduces the notion of descriptive patterns in the context of exact learning from positive data (see [19] for a survey on the influence of pattern languages in this area). While the other mentioned example mechanisms are probably more familiar to many readers, most research on descriptive representations has focused on various classes of pattern languages (see [9, 10] for some examples).

This concept allows us to define \(\mathcal {D}\)-descriptive generalization as a natural extension of Gold-style learning: Instead of attempting to learn an exact representation of the target language L from a sample S, the learner has to infer a representation \(\delta \in \mathcal {D}\) that is \(\mathcal {D}\)-descriptive of L. In other words, δ is a generalization of S that is as inclusion-minimal as possible within \(\mathcal {D}\).

Descriptive generalization explicitly separates the hypothesis space from the class of target languages, while still providing a natural quality criterion for generalization from positive examples. In the present paper, we consider the class of Sores and the class of Chares as hypothesis spaces \(\mathcal {D}\), and examine the problem of inferring \(\mathcal {D}\)-descriptive generalizations from finite samples.

As in [4], we approach this problem by first computing a Soa-descriptive Soa. As we shall see, this approach has the advantages that the descriptive Soa is uniquely defined, can be computed efficiently, and its language is included in the language of every descriptive Sore or Chare.

The main contribution of the present paper are two algorithms, Soa2Sore and Soa2Chare, that can be used to transform any given Soa into a Sore (resp. Chare) that is Sore-descriptive (resp. Chare-descriptive) of the language of that Soa. That is, given a sample S, these algorithms can be used to compute a generalization of S that is inclusion-minimal (or, in the terminology of [4], optimal) within the class of Sores or Chares (respectively).

In addition to this, Soa2Chare and Soa2Sore are efficient: Soa2Chare runs in time O(m) (compared to O(m + n 3) for CRX), Soa2Sore in time O(n m) (compared to O(n 5) for RWR), where m is the number of edges and n the number of vertices in the Soa.

The paper is structured as follows. Section 2 contains some mathematical preliminaries, followed by some informative properties of the language classes considered. Section 3 discusses CRX as well as RWR and its variants in the context of descriptive regular expressions. In particular, we show that for each of these algorithms, there are samples over small alphabets where the algorithm does not compute a descriptive Chare or Sore. Sections 4 and 5 contain the algorithms Soa2Chare and Soa2Sore, respectively, as well as proofs of their correctness and running time. In Section 6, we discuss the use of these algorithms for less restricted language classes, while Section 7 contains some example DTDs that were generated by a prototype implementation of the algorithms. Finally, Section 8 concludes the paper.

A preliminary version of this article appeared as [8]. Apart from some minor changes, the present version was improved as follows.

  • All proofs that were omitted from [8] have been included.

  • A mistake in the algorithm for finding descriptive Sores (more precisely: in its subroutine “bend”) was fixed.

  • Sections 6 and 7 were added.

2 Preliminaries

Let ∅ denote the empty set and let ε denote the empty word. With |x|, we denote the length of x if x is a word, or the number of elements in x if x is a set. We use ⊆ (and ⊂) to denote the inclusion (respectively proper inclusion) of sets. The difference of two sets A, B is denoted by AB and defined as {aA|aB}. A word v is a factor of a word x ∈ Σ if there exist u, w ∈ Σ such that x = u v w. A digram is a factor of length 2. Let term(w) denote the set of all letters occurring in a word w, and extend this to languages by defining \(\text {term}(L):= \bigcup _{w\in L}\text {term}(w)\).

The concatenation operator is extended to languages by defining L 1L 2:={w 1w 2w 1L 1, w 2L 2} for languages L 1 and L 2. For every language L, we define L 0:={ε}, L n + 1: = LL n for all n ≥ 0, and \(L^{+}:= \bigcup _{n\geq 1}L^{n}\).

2.1 Introducing SORE, CHARE, SOA

This section introduces the classes of regular expressions and automata used in this paper. We mostly follow the notations of [4].

Definition 1

Let Σ be a finite alphabet (the set of terminal letters, also called element names). Every letter a ∈ Σ is a regular expression and defines the language \(\mathcal {L}(a):=\{a\}\). Furthermore, ε and ∅ are regular expressions, with languages \(\mathcal {L}({\varepsilon }):=\{{\varepsilon }\}\) and \(\mathcal {L}(\emptyset ):=\emptyset \). If α is a regular expression, then α + and α? are regular expressions, where \(\mathcal {L}(\alpha ^{+}):=(\mathcal {L}(\alpha ))^{+}\) and \(\mathcal {L}(\alpha ?):=\mathcal {L}(\alpha )\cup \{{\varepsilon }\}\). Furthermore, if α and β are regular expressions, then α | β and αβ are also regular expressions, with \(\mathcal {L}(\alpha \, | \, \beta ):=\mathcal {L}(\alpha )\cup \mathcal {L}(\beta )\) and \(\mathcal {L}(\alpha \cdot \beta ):=\{uv\, | \,u\in \mathcal {L}(\alpha ), v\in \mathcal {L}(\beta )\}\).

For sake of convenience, we sometimes omit the concatenation operator (i. e., we write α β instead of αβ), and add or omit parentheses. For a regular expression α, we use t e r m(α) to denote the set of terminal letters that occur in α. We call two regular expressions α, β alphabet-disjoint if t e r m(α)∩t e r m(β) = ∅. Two regular expressions α and β are equivalent if \(\mathcal {L}(\alpha )=\mathcal {L}(\beta )\). For any set A = {a 1,…, a n }⊆Σ (n ≥ 1), we use the notation A L T(A) to denote the regular expression A L T(A):=(a 1 | ⋯ | a n ), with A L T(∅) = ε (ALT stands for alternation). In a strict sense, this definition requires an ordering on the letters to be sound, but for the purpose of this paper, this is of no concern, and we assume that A L T(A) = A L T(B) if A = B.

The full class of regular expressions is too strong both for DTDs (which allow only deterministic regular expressions) and for learning from positive data (which requires language classes that are sufficiently sparse, cf. [12]). As proven in [3], even the class of deterministic regular expressions is still too large to be learnable from positive data. Hence, [4] proposes the following subclasses of deterministic regular expressions.

Definition 2 (Sore / Chare)

Let Σ be a finite alphabet. A single-occurrence regular expression (or Sore) is a regular expression over Σ in which each terminal letter occurs (at most) once.

A chain regular expression (or Chare) is a Sore over Σ of the form f 1⋅…⋅f n (n ≥ 0), where each f i is a chain factor, i. e., a Sore of the form (a 1 | ⋯ | a k ), (a 1 | ⋯ | a k )?, (a 1 | ⋯ | a k )+, or (a 1 | ⋯ | a k )+?, where k ≥ 1, and each a j is a terminal letter.

A language L is called a Sore language (or a Chare language) if there exists a Sore (or a Chare) α with \(\mathcal {L}(\alpha )=L\).

In other words, a Chare consists of a concatenation of alphabet-disjoint chain factors. We illustrate these definitions with a few short examples.

Example 3

Consider the regular expressions α:=(a)?(b | c)+, β:=(a b)+, and γ: = a b a a. Here, α is a Chare (and, hence, also a Sore), as it consists of two alphabet-disjoint chain factors.

On the other hand, β is a Sore (every letter occurs only once), but not a Chare (as it is not composed of chain factors). In fact, not only is β not a Chare, one can also prove that \(\mathcal {L}(\beta )\) is not a Chare language.Footnote 2

Finally, γ is not a Sore (and therefore not a Chare), and one can prove that \(\mathcal {L}(\gamma )\) is not a Sore language.Footnote 3

While the focus of this paper is on learning regular expressions, most of our technical reasoning uses the following class of automata.

Definition 4 (Soa)

Let Σ be a finite alphabet, and let snk, src be distinct symbols that do not occur in Σ. A single-occurrence automaton (short: SOA) over Σ is a finite directed graph A = (V, E) such that

  1. (1)

    {src, snk} ∈V, and V⊆Σ∪{src, snk},

  2. (2)

    src has only outgoing edges, snk has only incoming edges, and every vV lies on a path from src to snk.

We call t e r m(A): = V∖ {src, snk} the set of terminal letters in A. We define the relation → A on V by → A : = E, and use → A and → A to denote the transitive and reflexive-transitive hull of → A . The language \(\mathcal {L}(A)\) that is accepted by A is the set of all words w = a 1a n (n ≥ 0) such that src A a 1 A ⋯→ A a n A snk. A language L is called a Soa language if there exists a Soa A \(\mathcal {L}(A)=L\). Two Soas A and B are equivalent if \(\mathcal {L}(A)=\mathcal {L}(B)\).

In order to ease understanding, we use specific language for vertices v in a Soa depending on the context: If the context is that of automata, we refer to it as a state; in the context of graph operations as a vertex.

A strongly connected component of a Soa A is a non-empty and inclusion-maximal set C of vertices of A such that for all a, bC, \(a{\rightarrow ^{*}_{A}} b\) and \(b{\rightarrow ^{*}_{A}} a\) holds. A strongly connected looped component of a Soa A is a non-empty and inclusion-maximal set C of vertices of A such that for all a, bC, \(a{\rightarrow ^{+}_{A}} b\) and \(b{\rightarrow ^{+}_{A}} a\) holds. In other words, if \({\rightarrow ^{+}_{A}}\) is interpreted as the reachability relation in A, every strongly connected looped component contains exactly those vertices that are mutually reachable. Thus, a strongly connected component may be a singleton without an edge, while a singleton strongly connected looped component must have a self-loop. By definition, all strongly connected looped components of a Soa are disjoint, and src and snk cannot be part of any strongly connected looped component.

Although their definition is somewhat different, it is easy to see that Soas are a subclass of DFAs. In particular, a Soa can be converted into a DFA by labeling every edge with the state that it points to. Further, every state that has an edge to the sink is made a final state. This is illustrated by the following example.

Example 5

In the picture below, we have a Soa on the left side, and the corresponding DFA to the right side. Note that, for depicting Soas, we use the symbol “ ∙” as a symbol for the source, and “” as a symbol for the sink.

figure b

Both automata generate the same language as the regular expression α:=((a c +?b)((a c +?b) | (c + b))+?)?.

In this paper, we frequently use Soas to approximate languages. For this, we rely on the following definition.

Definition 6

For every w ∈ Σ, let f i r s t(w) and l a s t(w) denote the first resp. last letter of w, and let g r a m 2(w) be the set of all digrams in w. We extend these functions on words to functions on languages by defining f i r s t(L):={f i r s t(w) | wL}, l a s t(L):={l a s t(w) | wL}, and \({gram_{2}}(L):= \bigcup _{w\in L}{gram_{2}}(w)\).

For every language L ⊆ Σ, we define the Soa approximation of L, SOA(L), by SOA(L) : =(V L , E L ), where V L : = term(L)∪ {src, snk}, and E L contains the edges

  • (src, a) for every af i r s t(L),

  • (a, snk) for every al a s t(L),

  • (a, b) for all a, b ∈ Σ with a bg r a m 2(L),

  • (src, snk) if εL.

Using this terminology, the approach for Soa learning presented in [11] can be summarized as follows. Given a finite set S, compute SOA(S). In [4], the resulting algorithm is called 2T-INF. The following observation follows immediately from Definition 6.

Corollary 7

For every language L, computing SOA(L) is only as hard as computing first(L), last(L), and gram 2(L).

Hence, SOA(L) can be constructed for languages from classes that are larger than the classes of finite or regular languages, e. g., for context-free languages; see Section 6.2 for a detailed discussion.

It is easy to see from the definition that \(\mathcal {L}(\text {SOA(L)})\supseteq L\) holds for every language L (in fact, we shall see in Proposition 15 that \(\mathcal {L}(\text {SOA(L)})\) is always the least general approximation of L that is possible with a Soa language). This inclusion can be proper as follows.

Remark 8

Note that even for finite languages L, the equality \(\mathcal {L}(\text {SOA(L)})=L\) is not necessary; e. g., consider L = {a b a a}. Then SOA(L) contains an edge from src to a, from a to b, from b to a, from a to itself, and from a to snk. Hence, \(aa\in \mathcal {L}(\text {SOA(L)})\), while a aL.

There are Soa languages that are not Sore languages. One example is the language \(\mathcal {L}(\alpha )\) from Example 5, but proving this using only techniques that have been introduced at this point requires considerable effort.Footnote 4 On the other hand, we have that every Sore language is a Soa language (in other words, the Soa approximation of a Sore language is exact).

Lemma 9 ([4], proof of Proposition 9)

Given any Sore α, we have \(\mathcal {L}(\text {SOA}(\mathcal {L}(\alpha )))=\mathcal {L}(\alpha )\).

Moreover, according to Corollary 7, SOA\((\mathcal {L}(\alpha ))\) can be derived directly from every Sore α.

Lemma 9 allows us to define SOA (α) as a notational shorthand for \(\text {SOA}(\mathcal {L}(\alpha ))\). Similarly, we use → α to denote the relation →SOA(α).

More importantly, we shall use Lemma 9 to develop a handy syntactic characterization of the inclusion for Sores (and Chares), which is based on the inclusion of Soas. We say that a Soa A covers a Soa B if A is a supergraph of B – in other words, t e r m(A)⊇t e r m(B) holds, and a B b implies a A b for all a, bt e r m(B). This definition leads to the following characterization of Soa inclusion.

Lemma 10 ([11], Theorem 3.1)

For every pair A,B of Soa s, \(\mathcal {L}(A)\subseteq \mathcal {L}(B)\) if and only if A is covered by B.

Although Lemma 10 is stated in [11] without proof (the authors cite García’s PhD thesis), it is easily proven considering the definition of SOA(L).

Combining Lemma 10 with Lemma 9, we are able to characterize inclusion of Sores as follows.

Lemma 11

For every pair α,β of Sore s, \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\) if and only if SOA (α) is covered by SOA(β).

This obviously implies that two Sores (or Chares) are equivalent if their corresponding Soas are equivalent. More importantly, Lemma 11 provides a simple syntactic and characteristic criterion for inclusion. While the algorithms in Sections 4 and 5 do not check for inclusion, their correctness proofs make heavy use of the fact that Sore inclusion depends on the presence of edges in the corresponding Soa. Before we introduce the other central definition of this paper in Section 2.2, we discuss some concepts which will be useful.

One can verify with little effort that the classes of Soa-, Sore-, or Chare languages are not closed under many of the operations that are commonly studied in formal language theory (e. g., concatenation, union, complementation, intersection with regular languages, morphism, inverse morphism). One of the few operations under which all these classes is closed is projection. Let Σ be an alphabet. A projection from Σ to T ⊆ Σ is a morphism π T : ΣT that is defined by π T (x): = x for all xT, and π T (x): = ε for all x ∈ Σ∖T. We extended this to languages canonically, i. e., π T (L):={π T (w) | wL}.

Lemma 12

The classes of Sore-, Chare-, and Soa languages are closed under projection.

Proof

Let T ⊆ Σ. Regarding Sores, consider an arbitrary Sore α over Σ. For every letter at e r m(α) with aT, replace a with ε, and call the resulting expression α . As α was a Sore, α is a Sore as well, and it is easily seen that \(\mathcal {L}(\alpha ^{\prime })={\pi _{T}}(\mathcal {L}(\alpha ))\). Hence, \({\pi _{T}}(\mathcal {L}(\alpha ))\) is a Sore language.

Regarding Chares, consider an arbitrary Chare β over Σ. By definition, there exist pairwise alphabet-disjoint chain factors f 1,…, f n (n ≥ 0) with β = f 1⋅…⋅f n . We now define A i : = t e r m(f i ) and T i : = A i T for each 1 ≤ in. For each 1 ≤ in, we define chain factors \(f^{\prime }_{i}\) and \(f^{\prime \prime }_{i}\) in the following way: If T i = ∅, let \(f^{\prime }_{i}:= f^{\prime \prime }_{i}:= {\varepsilon }\). Otherwise, let

$$f^{\prime}_{i}:= \left\{\begin{array}{ll} ALT(T_{i}) &\text{if \(f_{i}=ALT(A_{i})\) or \(f_{i}=ALT(A_{i})?\),}\\ ALT(T_{i})^{+} &\text{if \(f_{i}=ALT(A_{i})^{+}\) or \(f_{i}=ALT(A_{i})^{+}?\).} \end{array}\right. $$

and

$$f^{\prime\prime}_{i}:= \left\{\begin{array}{ll} f^{\prime}_{i}? &\text{if} \;A_{i}\not\subseteq T \;\text{or}\; {\varepsilon}\in\mathcal{L}(f_{i}),\\ f^{\prime}_{i} &\text{if}\; A_{i}\subseteq T \;\text{and}\; {\varepsilon}\notin\mathcal{L}(f_{i}). \end{array}\right. $$

Finally, let \(\beta ^{\prime \prime }:= f^{\prime \prime }_{1}\cdot \ldots \cdot f^{\prime \prime }_{n}\), and remove all chain factors \(f^{\prime \prime }_{i}={\varepsilon }\) from β . Again, it is easy to see that \(\mathcal {L}(\beta ^{\prime \prime })={\pi _{T}}(\mathcal {L}(\beta ))\), which means that \({\pi _{T}}(\mathcal {L}(\beta ))\) is a Chare language.

Finally, regarding Soas, consider an arbitrary Soa A over Σ. We construct a Soa A from A by iteratively removing letters; i. e., in each step, a letter a ∈ (t e r m(A)∖T) and its associated edges are deleted, and for every pair of vertices u and v such that u A a and a A v holds, an edge (u, v) is added. Then \(\mathcal {L}(A^{\prime })={\pi _{T}}(\mathcal {L}(A))\); hence, \({\pi _{T}}(\mathcal {L}(A))\) is a Soa language. □

The main approach in the present paper (as well as in [4]) is converting Soas into Sores or Chares. During this process, it is occasionally convenient to work with a model that can be viewed as an intermediary step between a Soa and a regular expression.

Definition 13

Let Σ be a finite alphabet, and let snk, src be distinct symbols that do not occur in Σ. A generalized single-occurrence automaton (or generalized Soa) over Σ is a finite directed graph A = (V, E) such that

  1. (1)

    {src, snk} ⊆V, and all vertices in V∖ {src, snk} are pairwise alphabet-disjoint Sores; and

  2. (2)

    the edge relation E is such that src has only outgoing edges; snk has only incoming edges, and every vV lies on a path from src to snk.

The relations → A , \({\rightarrow ^{*}_{A}}\), \({\rightarrow ^{+}_{A}}\) on V are defined analogously to (non-generalized) Soa. We extend term to generalized Soas by defining \(term(A):= \bigcup _{v\in V\setminus \{{\mathtt {src}},~{\mathtt {snk}}\}} term(v)\).

The language \(\mathcal {L}(A)\) is defined to be the set of all wt e r m(A) for which there exist a n ≥ 0, vertices v 1,…, v n V∖{src, snk}, and words w 1,…, w n t e r m(A) such that src A v 1 A ⋯→ A v n A snk, w = w 1w n , and \(w_{i}\in \mathcal {L}(v_{i})\) holds for every 1 ≤ in.

Note that generalized Soas accept the same class of languages as Soas: As the Sores are required to be alphabet-disjoint, every generalized Soa can be transformed into a Soa by replacing each Sore with its Soa.

Just as for Soas, we again use specific language for vertices of generalized Soas depending on context. In addition to the words state and vertex, in the context of manipulating the Sore v (since every vertex is a Sore), we talk about the Sore as the label of v.

2.2 Descriptivity

This section introduces the notion of descriptive expressions and automata, which is one of the central aspects of the present paper.

Definition 14

Let \(\mathcal {D}\) be a class of regular expressions or finite automata over some alphabet Σ. A \(\delta \in \mathcal {D}\) is called \(\mathcal {D}\)-descriptive of a non-empty language S ⊆ Σ if \(\mathcal {L}(\delta )\supseteq S\), and there is no \(\gamma \in \mathcal {D}\) such that \(\mathcal {L}(\delta )\supset \mathcal {L}(\gamma )\supseteq S\).

In other words, an expression or automaton that is \(\mathcal {D}\)-descriptive of a language S generates a language that is a generalization of S that is ⊆-minimal within languages described by elements of \(\mathcal {D}\). If the class \(\mathcal {D}\) is clear from the context, we simply write descriptive instead of \(\mathcal {D}\)-descriptive.

As stated in [11] (using different terminology), for every finite language S, S O A(S) is Soa-descriptive of S. This extends to infinite languages as well; for Sores and Chares, we can also prove the existence of descriptive regular expressions. Note that this proof is non-constructive; in later sections we will be concerned with efficiently finding descriptive Chares and Sores.

Proposition 15

Let Σ be a finite alphabet. For every language L ⊆ Σ, SOA(L) is Soa -descriptive of L, and there exist a Sore-descriptive Sore δ s and a Chare -descriptive Chare δ c .

Proof

We begin with the claim for Soas. First, note that every edge in S O A(L) corresponds to a first letter, a last letter, or a digram of a word in L. Hence, these edges must occur in every Soa A with \(\mathcal {L}(A)\supseteq L\). By Lemma 10, this means that for every such Soa A, \(\mathcal {L}(A)\supseteq \mathcal {L}(SOA(L)) \supseteq L\). In particular, there is no Soa A with \(\mathcal {L}(SOA(L))\supset \mathcal {L}(A) \supseteq L\), as this would imply \(\mathcal {L}(SOA(L))\supset \mathcal {L}(A) \supseteq \mathcal {L}(SOA(L))\). Therefore, S O A(L) is descriptive of L.

Regarding the second claim, let \(\mathcal {D}\in \{\textsc {Chare},\textsc {Sore}\}\) and assume that there is a language L over some finite alphabet Σ such that no expression \(\alpha \in \mathcal {D}\) is \(\mathcal {D}\)-descriptive of L. This implies that there is an infinite sequence (β i ) i ≥ 0 of expressions from \(\mathcal {D}\) with α = β 0, and \(\mathcal {L}(\beta _{i})\supset \mathcal {L}(\beta _{i+1})\supseteq L\) for all i ≥ 0. This contradicts the fact that there is only a finite number of non-equivalent Sores (and, hence, Chares) over Σ. Hence, for every language L, a Chare-descriptive Chare and a Sore-descriptive Sore must exist. □

An immediate consequence of Proposition 15 is the following observation.

Corollary 16

For every language L, \(\mathcal {L}(SOA(L))=L\) iff L is a Soa language.

More importantly, Proposition 15 implies that the algorithm 2T-INF from [4] that was mentioned in the previous section can be used to compute Soa-descriptive Soas for finite sample sets. Moreover, together with Corollary 7, this shows that constructing a descriptive Soa for an arbitrary language L is as hard as computing the sets first (L), last (L), and gram2(L).

As we shall see, computing descriptive Sores or Chares is less straightforward. First, note that the first part of the proof of Proposition 15 implies the following observation.

Corollary 17

Let Σ be a finite alphabet, and let L ⊆ Σ. For every Sore (or Chare) δ that is Sore -descriptive (resp. Chare-descriptive) of L, \(\mathcal {L}(\delta )\supseteq \mathcal {L}(SOA(L))\) holds.

Hence, if some Sore (or Chare) is descriptive of a language L, it must be descriptive of \(\mathcal {L}(\text {SOA}(L))\) as well. This allows us to compute descriptive Sores and Chares not from a sample L, but from its Soa approximation SOA(L).

Furthermore, if \(\mathcal {L}(\text {SOA}(L))\) is not a Sore language (or not a Chare language), a Soa for some Sore that is descriptive of L can be obtained as follows: iterate through all sets of missing edges in a ⊆-increasing way; for each set, check whether adding these edges turns the Soa into accepting a Sore language. The main question is whether this approach is efficient: as it can be necessary to add a substantial number of new edges in order to turn a Soa into a Soa that corresponds to a descriptive expression (see Proposition 18 just below), such a brute force approach is probably not advisable.

The next proposition lists these and other numbers about counting and descriptive Sores and Chares; it is also of independent interest in order to understand the classes of Sores and Chares better. From [3, Proof of Theorem 3.1] we know that any Sore language has a Sore of length at most 10n−4, which gives a bound of 2O(n logn) on the number of different Sore languages. Our results are summarized in Table 1. Recall that regular expressions are called equivalent if they generate the same language.

Table 1 A summary of the numbers presented in Proposition 18

Proposition 18

Let Σ be a finite alphabet of n alphabet symbols. We have the following, for some constant r.

  1. (1)

    The number of pairwise non-equivalent Chare s is c(n) with n! 22nc(n) ≤ n! 23n.

  2. (2)

    The number of pairwise non-equivalent Sore s is s(n) with n! 23n−r logns(n) ≤ n! 27n.

  3. (3)

    There is a sample S ⊆ Σ such that S has 2n pairwise non-equivalent descriptive Sore s.

  4. (4)

    There is a sample S ⊆ Σ such that S has n! pairwise non-equivalent descriptive Chares.

  5. (5)

    There is a Soa with Θ(n) edges such that a descriptive Sore with a minimal number of edges in the corresponding Soa has Θ(n 2) edges.

  6. (6)

    There is a Soa with Θ(n) edges such that a descriptive Chare with a minimal number of edges in the corresponding Soa has Θ(n 2) edges.

Proof

We start with showing (1). We have n! 22nc(n), as any sequence of all and only the elements from \(\{a_{i}\,|\,i \in \mathbb {N}\}\), each with one of the four possibilities of adding + (for repetition) or ? (for optional), give non-equivalent Chares. On the other hand, we can bound the number of syntactically different Chares as follows. There are n! different choices for the order of the terminal symbols. For a given order of the terminal letters, we associate any binary string x of length n−1 with the Chare that uses the given order of terminal letters, and, for all in, adds a new chain factor for the i + 1th letter iff x at place i has a 1 (and use the same chain factor otherwise). For a Chare with k chain factors, there are 4k different choices for the annotation of the chain factors (none, ?,+, or+?). Thus, there are at most

$$\sum\limits_{k=1}^{n-1} n! \; 4^{k} \binom{n-1}{k} \leq n! \; 4^{n} \sum\limits_{k=0}^{n} \binom{n}{k} = n! \; 2^{3n} $$

possibilities for syntactically distinct Chares, which gives the upper bound.

We now discuss (2). We view a Sore as a binary tree with any internal node labeled by “ | ” (for the disjunction) or “ ⋅” (for concatenation), and all leaves labeled by distinct alphabet symbols; furthermore, any node (including the leaves) can be additionally labeled by “+” (for repetition) and/or “?” (for optional). We use the intuitive translation of such trees as syntax trees of Sores; clearly, any Sore on n alphabet symbols can be equivalently written as such a tree. The number of different binary trees with k leaves is known as the sequence of Catalan numbers and is asymptotically 22k−Θ(logk). A much more precise bound is known; however, this bound will suffice for our rough estimates. With two different choices for the label of all internal nodes and four different choices for each node for the “+” and/or “?” label, we get a maximal number of

$$n! \; 2^{2n - {\Theta}(\log n)} 2^{n} 4^{2n} \leq n! \; 2^{7n} $$

many different non-equivalent Sores. Regarding the lower bound, consider now all such syntax tree with all internal nodes labeled “ ⋅” and “+”, and all leaves possibly labeled by “+”. It is easy to check that this corresponds to pairwise non-equivalent Sores. This gives a lower bound of

$$n! \; 2^{2n - {\Theta}(\log n)} 2^{n} = n! \; 2^{3n - {\Theta}(\log n)}. $$

Regarding (3), we note that the sample S = {a B 1 a B 2 a B 3 a B 4a B n a} has 2n pairwise non-equivalent descriptive Sores as follows. For any partition of {B i | 1 ≤ in} into two disjoint (but possibly empty) sets B 1 and B 2, we have that (A L T(B 1)?a A L T(B 2)?)+ is a Sore-descriptive of S (recall that A L T(∅) = ε). We see that these Sores are descriptive by applying either the Sore construction algorithm from [4] (which finds a Sore equivalent to a given Soa, if existent) or ours from Section 5 and observing a strict generalization.

Regarding (4), we note that, for all n ≥ 2, the sample \(\{{a_{i}^{2}} \mid 1\leq i \leq n\}\) has n! pairwise non-equivalent descriptive Chares (namely all Chares of the form A p(1)?…A p(n)?, where p is a permutation of {1,…, n}).

Regarding (5), suppose n ≥ 1 and consider the sample S = {A i A i + 1∣1 ≤ i<n}∪{A 1, A n }. The Soa corresponding to S has Θ(n) edges. Let α be a Sore-descriptive of S; we will show that, for all i, j with i<j<n, we have A i α A j . Let x be the least common ancestor of A i and A j in the syntax tree of α. As there is a word in \(\mathcal {L}(\alpha )\) which starts with A j , the child of x which contains A j generates words starting with A j symmetrically, the child of x which contains A i generates a word ending with A i . Thus, we are done if x is labeled by concatenation and A i is in the left child of x (and A j in the right). It is straightforward to see that the other cases will similarly necessitate A i α A j , by using \(a_{i} {\rightarrow ^{+}_{S}} a_{j}\). This also implies that A 1?…A n ? is the only Sore that is descriptive of S (modulo equivalence; as equivalent Sores like (A 1?…A n ?)? are also descriptive of S).

We have that (6) follows from the proof of (5), as the only Sore-descriptive Sore for the sample in that case is also a Chare. □

In particular, note that Proposition 18 also demonstrates that a given sample can have numerous different descriptive Sores (or Chares). Note that the number of different Chare- and Sore languages can be better approximated using more advanced tools from combinatorics (including a more precise counting, for example with the inclusion-exclusion principle, and the use of sophisticated bounds for known number sequences, such as the Bell numbers). Finally, if we are only interested in the number of different such languages modulo renaming of the terminal letters, then the same bounds without the factor n! hold.

3 Descriptivity Versus CRX and RWR

Proposition 18 demonstrates that the number of non-equivalent descriptive Sores (or Chares) for a sample can be exponential in the size of the alphabet. Therefore, the present paper only examines the question how a single descriptive Sore (or Chare) can be found for a sample, instead of looking for an enumeration of all these expressions.

As explained in Section 2.2 (in particular, Corollary 17), descriptive Chares and Sores can be obtained from the descriptive Soa, and moreover, for every language L and every Sore α, \(\mathcal {L}(\alpha )\supseteq \mathcal {L}(\text {SOA}(L))\) must hold. This observation motivates our inference approach for Sores and Chares: Given a sample S, first compute the Soa-descriptive single-occurrence automaton SOA (S), using 2T-INF. As explained in [11], this can be done in time O(l n), where \(l:= {\sum }_{s\in S}|s|\), and n:=|term(S)|.

Using the algorithm Soa2Chare (Section 4) or Soa2Sore (Section 5), SOA(S) is then turned into a descriptive Chare or Sore (respectively). Before we discuss these algorithms and the respective proofs in detail, we observe that the algorithms CRX and RWR and their variants from [4] do not always compute descriptive Chares or Sores.

For the Chare algorithm CRX, this is quite easy to see: As pointed out in [4] (as a remark after Theorem 35), on the sample S = {a b c, a d e, a b e}, the algorithm CRX returns the Chare a?b?c?d?e?, while δ: = a(b | d)(c | e) is a better approximation of S. In fact, we shall be able to see that δ is not only better, but Chare-descriptive. This can be verified by observing that δ is the output of Soa2Chare on SOA(S), and referring to Theorem 21 further down.

The proofs for the non-descriptivity of the Sore algorithm RWR and its variants require more effort; we proceed by giving a description of RWR, followed by a proof of their non-desriptivity in Section 3.2.

3.1 Description of RWR

In this section we describe the algorithm RWR from [4] in some detail. For a more formal definition of RWR and the rules it uses, as well as any variants, we refer to [4]. This algorithm takes a Soa A as input and, by step-by-step modifications, turns it into smaller and smaller generalized Soas, until a generalizing Sore can be read off the only remaining vertex (apart from source and sink).

The modifications concern either one or two states of the Soa; two types of rules are distinguished: rewrite rules and repair rules. First, rewrite rules are applied, as long as any are possible. From [4] we know that these rules will turn any Soa into an equivalent Sore, if possible. If, at some point, none of the rules are applicable, then we know that the input Soa is not equivalent to a Sore, and the repair rules are used to generalize (which will then enable rewrite rules to continue). Let A be a Soa used as input to RWR. The rewrite rules concerning only one state are as follows.

  • Iteration r +: If there is a vertex labeled r with a self-loop, remove the self-loop and change the label of the vertex to r +.

  • Optional r?: If there is a vertex labeled r with the source as predecessor, the sink as a successor, and there is an edge from source to sink, remove the edge from source to sink and change the label of the vertex to r?.

The rewrite rules concerning two states have similar syntactic criteria. For clarity, we give semantic criteria only. Let r and s be labels of two vertices of A. We try contracting these two vertices (in a graph-theoretic sense, i.e., with appropriate edge modifications) into a vertex labeled one of

$$r \, | \, s, \;\; r \cdot s, \;\; r? \cdot s, \;\; r \cdot s?, \;\; r? \cdot s?$$

and possibly with a self loop. If one of these leads to a generalized Soa accepting the same language, then we keep the contraction, and continue from this generalized Soa. In [4] equivalent rules are stated, but with syntactic criteria which are easy to check (but tedious to state).

When no more rewrite rules are applicable, then either only one state apart from source and sink remains (which allows us to read off the Sore), or the original Soa was not equivalent to a Sore and we need to generalize. RWR uses the following (syntactic) repair rules which generalize the Soa.

  • Repair r | s: If there are two vertices r and s of A which share a successor or a predecessor, add edges to A to make all successors of r or s successors of both r and s; similarly with the predecessors. If there is an edge between r and s, add an edge in the opposite direction as well; add self loops on both r and s unless where the label of the vertex has a “+” at the root of the syntax tree.

  • Repair rs?: If there are two vertices r and s of A such that r is the only predecessor of s, add edges to A to make all successors of r or s (except s) successors of both r and s; add a self-loop on r only if the label of r does not have a “+” at the root of its syntax tree.

  • Repair r?⋅s: If there are two vertices r and s of A such that s is the only successor of r, add edges to A to make all predecessors of r or s (except r) predecessors of both r and s; add a self-loop on s only if the label of s does not have a “+” at the root of its syntax tree.

  • Repair r?⋅s?: Let r and s be vertices of A such that s is a successor of r; add edges to A to make all successors of r or s successors of both r and s; similarly with the predecessors; for both r and s, introduce self-loops only on vertices which do not have a label with “+” at the root of its syntax tree. Furthermore, for all predecessors u of r and all successors v of s, add an edge from u to v.

The repair rules are applied only if none of the rewrite rules is applicable, and the first applicable rule from the list above is applied. After each application of a repair rule, rewrite rules are used again as long as possible. This terminates with a generalized Soa with only a single state different from source and sink; its label is the output of RWR.

3.2 RWR-Variants and Descriptivity

In this section we give theorems regarding properties of RWR-variants. In particular, we show that every variant fails to find a descriptive Sore on some input. First, we formally show that RWR does not always return a descriptive Sore.

Theorem 19

For Σ a finite alphabet with | Σ | ≥ 3 and all orderings of the repair rules of RWR, there is a (finite) set of samples S ⊆ Σ such that RWR on S produces a Sore which is not Sore-descriptive.

Proof

Let a, b, c ∈ Σ be three different symbols from Σ. First, consider the sample {a b a, a b}. The corresponding Soa does not allow rewrite rules and requires repair; below this Soa is depicted, along with two possible repairs, corresponding to the two possible repairs “Repair a | b” and “Repair a?⋅b?”.

figure c

The Soas resulting from the two repairs accept (a | b)+ and (a | b), respectively, which is not descriptive of {a b a, a b}, as witnessed by δ 1 := (a(b?))+ (a Sore which accepts the given sample aba and ab, but not, for example, b, which is accepted by any of the Soas derived from repair rules above).

Second, consider the sample S = {a b, a c, a c a c}. The corresponding Soa A is depicted as follows.

figure d

A descriptive Sore for S is δ 2:=(a(b | c))+, which we prove as follows. In comparison to A, the Soa that corresponds to δ 2 adds only a single edge, the edge from a to b. So the only possibility for a Sore language \(\mathcal {L}(\gamma )\) with \(\mathcal {L}(A)\subseteq \mathcal {L}(\gamma )\subset \mathcal {L}(\delta _{2})\) is \(\mathcal {L}(A)\) itself. However, \(\mathcal {L}(A)\) is not a Sore language, which can be seen, just as in Proposition 18, by applying either the Sore construction algorithm RWR from [4] or our algorithm Soa2Sore from Section 5 (which both compute a Sore equivalent to a given Soa, if existent) and observing a strict generalization. Hence, δ 2 is Sore-descriptive of S. (We note without proof that (a c?)+ b? is another Sore that is descriptive of S. Necessarily, its language is incomparable to \(\mathcal {L}(\delta _{2})\).)

An application of “Repair ab?” on A and then, after rewriting, of “Repair [a b?]⋅c?” gives the following.

figure e

This Soa corresponds to the Sore (a b?c?)+, and its language is a strict superset of \(\mathcal {L}(\delta _{2})\) for the (descriptive) Sore δ 2 = (a(b|c))+ (for example abc is generated by the former and not the latter). Deceiving the rule “Repair r?⋅s” is symmetric to deceiving “Repair rs?”. □

In [4], Bex et al. propose a variant of RWR that is called RWR \(^{2}_{\ell }\), which uses a natural number as a branching parameter. The algorithm explores the (recursive) outcomes of the best candidates for a repair rule, choosing the ones that lead to a minimal number of words of length at most 2n (=2|Σ|) in the language generated by the resulting Sore.

Theorem 20

For all ℓ>0 there is a finite alphabet Σ with |Σ | =3ℓ and a finite set of samples S ⊆ Σ such that RWR \(^{2}_{\ell }\) on S produces a Sore which is not Sore -descriptive.

Proof

We first assume = 1; consider again the sample {a b, a c, a c a c} with the following corresponding Soa.

figure f

The three applicable repair rules are b | c, ab? and ac? (plus some rules of the type “ r?⋅s?”, which explode the number of accepted words). This leads to the following Soas.

figure g

In order to determine which of the rules RWR will choose, we analyse the properties of these three Soas, which we summarise in Table 2. Thus, we see that second possibility accepts a minimal number of words of length at most 6 (=2|Σ|), which means that only this option will be explored, the first and the third wi ll be discarded. After rewriting by RWR, this results in the following Soa.

figure h
Table 2 Properties of the languages discussed in the proof of Theorem 20

The minimal repair for this results in (a b?c?)+, which is not descriptive as witnessed by (a(b | c))+ as in the proof of Theorem 19.

For >1, we use independent copies of the sample used for = 1 (i.e., using different alphabet symbols). Thus, RWR \(^{2}_{\ell }\) will fail on at least one of these copies. □

4 Descriptive CHAREs

In this section, we give the first main algorithm of this paper, Soa2Chare, which efficiently computes descriptive Chares from given Soas.

4.1 The CHARE Algorithm

The algorithm Soa2Chare uses a number of subroutines, which are written with a dot-notation similar to some modern object oriented programming languages. For example “A.contract (U, )” denotes the application of the subroutine “contract” to the Soa A with parameters U and . For a given Soa A, we let A.src and A.snk denote the source and the sink of A, respectively. The following subroutines are used in Soa2Chare.

  • “contract” on Soa A takes a subset U of vertices of A and a label . The procedure modifies A such that all vertices of U are contracted to a single vertex and labeled (edges are moved accordingly).

  • “constructLevelOrder” on Soa A = (V, E) assumes that A is acyclic and assigns a level number to every vertex vV, where the level number of a vertex vV is defined to be the length of the longest path from A.src to v. Hence, A.src is on level number 0, and for every other vertex v, the level number is one more than the highest level number of the immediate successors of v.

  • “isSkipLevel” on Soa A and a level number i returns true if level i is a skip level. A level i is a skip level if there exist vertices u, vV with (respective) level numbers j u <i and j v >i such that u A v. In other words, one can skip level i by transitioning from u to v.

figure gi

Note that the use of “contract” can turn the Soa into a generalized Soa. Intuitively speaking, the algorithm Soa2Chare works as follows:

  1. (1)

    Replace each strongly connected looped component AV with a vertex that is labeled with the regular expression A L T(A)+. This turns A into a (possibly generalized) Soa that is a DAG.

  2. (2)

    Every vertex in the DAG is assigned a level number.

  3. (3)

    Every level is turned into one or more chain factors. If a level contains more than one non-letter vertex, or if a level is a skip level, ? is appended to every chain factor on that level.

The following theorem states that Soa2Chare can be used to compute Chare-descriptive Chares in an efficient manner.

Theorem 21

For any given Soa A, Soa2Chare finds a Chare that is Chare -descriptive of \(\mathcal {L}(A)\) in time O(m), where m is the number of transitions of A.

Before we discuss the proof of Theorem 21 further down, we illustrate the behavior of Soa2Chare with an Example.

Example 22

Let S = {a b a f, a b e f, c c d f}. The corresponding Soa, SOA(S), is depicted as follows.

figure j

First, Soa2Chare removes all cycles by contracting strongly connected looped components. This leads to the following generalized Soa.

figure k

Apart from the levels for A.src and A.snk, this generalized Soa has three levels: The first level with the vertices (a | b)+ and (c)+, the second level with the vertices d and e, and the third level with the vertex f. As there is an edge between (a | b)+ and f, the second level is a skip level. Thus, the levels lead to the respective Chares (a | b)+?(c)+?, (e | d)?, and f, which are concatenated to (a | b)+?(c)+?(e | d)?f. By Theorem 21, this Chare is Chare-descriptive of S.

Proof of Theorem 21.

We first prove termination and running time, followed by the proof of correctness.

Termination and running time

Termination is obvious, as the two loops (in lines 2 and 7) are executed only a bounded number of times.

Let n denote the number of vertices and m denote the number of edges in the input Soa. In the while-loop in line 2, the input Soa is transformed into an acyclic generalized Soa. Using Tarjan’s algorithm for finding strongly connected components (cf. [6, Section 22.5]), this part can be realized in time O(m + n).

Computing the level order and annotating, for each level, whether that level is a skip level, can also be done in time O(m + n), analogously to a topological sorting.

Finally, each vertex in the generalized Soa is turned into a chain factor. This takes time O(n). Hence, the individual steps sum up to a time of O(m + n), which results in a total time of O(m), as nm holds by definition.

Correctness

First, it is quite easy to see that Soa2Chare computes a Chare. Note that, in order to prove that this Chare is descriptive of the sample S, we do not need to argue about every Chare γ with \(\mathcal {L}(\gamma )\supseteq S\), but only about those with \(\mathcal {L}\)(Soa2Chare(SOA(S))) \(\supseteq \mathcal {L}(\gamma )\supseteq S\).

This allows us to use Lemma 11 from two directions: On the one hand, every edge (and hence, every path) that is present in SOA(S) must be present in SOA(γ), on the other hand, SOA(γ) must not contain any edges that do not occur in SOA(δ).

Before we consider the main part of the proof, we first develop some technical tools that deal with strongly connected looped components.

Lemma 23

Let α be a Chare. A set Aterm(α) is a strongly connected looped component in SOA(α) if and only if α contains a chain factor of the form ALT(A) + or ALT(A) +?.

Proof

We begin with the if direction. Assume that some Chare α contains a chain factor α with α = A L T(A)+ or α = A L T(A)+? for some finite and non-empty set A⊆Σ. By definition, a α b holds for all letters a, bA. Furthermore, let α = α 1 α α 2 for appropriate (possibly empty) Chares α 1, α 2. For every vertex vt e r m(α 1)∪ { A.src} and every aA, we observe \(v{\rightarrow ^{+}_{\alpha }}a\), but not \(a{\rightarrow ^{+}_{\alpha }}v\). Likewise, for every vertex vt e r m(α 2)∪ { A.snk} and every aA, we observe \(a{\rightarrow ^{+}_{\alpha }}v\), but not \(v{\rightarrow ^{+}_{\alpha }}a\). Hence, A is a strongly connected looped component in SOA(α).

For the only if direction, we assume that there is a (non-empty) set At e r m(α) that is a strongly connected looped component in SOA(A). Now, let α = α 1α n for some n, where each α i is a chain factor. As A is a strongly connected looped component, \(a{\rightarrow ^{+}_{\alpha }}b\) holds for all a, bA. Therefore, there is an k with A k : = t e r m(α k )⊇A (otherwise, there would exist a, bA, at e r m(α i ), bt e r m(α j ), i<j, which would imply \(a{\rightarrow ^{+}_{\alpha }}b\), but not \(b{\rightarrow ^{+}_{\alpha }}a\)). Moreover, as \(a{\rightarrow ^{+}_{\alpha }}a\) holds for all aA, α k is equal to A L T(A k )+ or A L T(A k )+?. Finally, if A k A, there would exist an bA k A with b α a α b for all aA. As bA, this would imply that A is not a strongly connected looped component, and contradict our initial assumption. Hence, A k = A, and α k ∈ {A L T(A)+, A L T(A)+?}. □

As Soa2Chare turns every strongly connected looped component A into a chain factor A L T(A), we observe that Soa2Chare does not change these components.

Corollary 24

Let Σ be an alphabet. For every finite and non-empty set S ⊆ Σ, and every set Aterm(S), the following holds. A is a strongly connected looped component in SOA(S) if and only if A is a strongly connected looped component in SOA(Soa2Chare(SOA (S)).

Finally, according to Lemma 11, this immediately leads to the following observation:

Corollary 25

Let S ⊆ Σ be a finite set, and let δ:=Soa2Chare(SOA (S)). For every Chare γ with \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\supseteq S\), SOA(γ) must contain exactly the same strongly connected looped components as SOA(S) and SOA(δ).

We now posses all the tools we need to execute the main element of the proof of correctness of Soa2Chare.

Lemma 26

Let Σ be an alphabet, let S ⊆ Σ be a non-empty set, and let δ := Soa2Chare(SOA (S)). Then \(\mathcal {L}(\delta )=\mathcal {L}(\gamma )\) holds for every Chare γ with \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\supseteq S\).

Proof of Lemma 26. We shall prove Lemma 26 by proving a stronger claim, namely that δ = γ holds. This claim only holds if we allow for a slight abuse of the = symbol; as for the remainder of this proof, we shall interpret α = β to mean that α and β are identical modulo reordering of the terminals symbols inside the chain factors (i. e., (a | b) = (b | a) holds, but (a)(b)≠(b)(a)).

Before we proceed to the actual proof, we begin with a preliminary observation. In line 5, the algorithm Soa2Chare partitions the vertices of S O A(S) into levels 0 to n (with n ≥ 1). Note that with the exception of levels 0 and n (which contain only A.src and A.snk, respectively), each level i leads to a sub-Chare δ i , which consists of one or more chain factors. Hence, Soa2Chare implicitly defines a factorization δ = δ 1δ n−1, where each δ i was derived from level i. Note that δ = ε holds for the special case of n = 1.

We prove Lemma 26 using induction on this n, i. e., the level number assigned to A.snk (which can also be understood as the number of levels after A.src, and n−1 is the number of factors in the factorization δ 1δ n−1 as given above). More specifically, we prove the following claim for every n ≥ 1.

Claim 1

. Let Σ be an alphabet, let S⊆Σ be a non-empty set for which the level construction step of Soa2Chare creates levels 0 to n, and let δ:=Soa2Chare(SOA(S)). Then δ = γ holds for every Chare γ with \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\supseteq S\).

Base case.

For n = 1, S O A(S) contains only the vertices A.src and A.snk, and S = {ε} must hold. As δ = ε and \(\mathcal {L}(\delta )=S\), \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma ) \supseteq S\) implies γ = ε = δ for every Chare γ.

Inductive step.

Now assume that Claim 1 holds for some n ≥ 1. Let S be a set for which the level construction step of Soa2Chare creates levels 0 to n + 1, let δ:=Soa2Chare(SOA(S)) with δ = δ 1δ n , where each δ i was derived from level i, and let γ be a Chare such that \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\supseteq S\) with γ = γ 1γ m (m ≥ 1), where each γ i is a chain factor. W. l. o. g. t e r m(δ) = t e r m(γ) = t e r m(S) = Σ.

We define the Chare δ : = δ 1δ n−1 (with δ : = ε if n = 1), and the set \(S^{\prime }:= {\pi _{term(\delta ^{\prime })}}(S)\). In other words, δ is obtained by removing level n from the level construction for S; thus, δ =Soa2Chare(SOA (S )) holds by definition. The proof is based on the following claim.

Claim 2

. There exists some i with 1 ≤ im such that δ n = γ i γ m .

Proof of Claim 2.

When building δ n , Soa2Chare constructs the sets B and C for level n, where B contains all vertices on level n that are labeled with+ (and, hence, represent some strongly connected looped component that was contracted in lines 2–4), while C contains all vertices with level number n that represent single letters. Note that at most one of the sets B and C may be empty. Hence, exactly one of the following cases holds:

  1. (1)

    B = ∅, C ≠ ∅, level n is not a skip level,

  2. (2)

    B = ∅, C ≠ ∅, level n is a skip level,

  3. (3)

    |B|=1, C = ∅, level n is not a skip level,

  4. (4)

    |B|=1, C = ∅, level n is a skip level,

  5. (5)

    |B|≥2, C = ∅,

  6. (6)

    B ≠ ∅, C ≠ ∅.

Case

(1): If B = ∅ and C ≠ ∅, and level n is not a skip level, δ n = A L T(B) holds. Then level n of the construction contains exactly the vertices in C, and for all vertices v ∈ Σ∪ {snk} and all cC, c s v if and only if v = snk. Furthermore, as level n is not a skip level, there is no vertex v ∈ (Σ∪ {src}) ∖C with v s snk. By definition, these observations still hold if → s is replaced with → δ .

According to Lemma 11, S O A(δ) covers S O A(γ), which in turn covers S O A(S). Hence, for all vertices v ∈ (Σ∪ {src}), v γ snk holds if and only if vC. Therefore, t e r m(γ m ) = C and \({\varepsilon }\notin \mathcal {L}(\gamma _{m})\) must hold. This is only satisfied if γ m = A L T(C) or γ m = A L T(C)+. We can exclude the latter case, as C is not a strongly connected looped component in S O A(S) or S O A(δ), and Corollary 25 applies. Hence, γ m = A L T(C) = δ n follows.

Case

(2): If B = ∅ and C ≠ ∅, and level n is a skip level, δ n = A L T(C)?. As in Case (1), we can observe that for all vertices v ∈ Σ∪ {snk} and all cC, c s v (and c δ v) if and only if v = snk. But as n is a skip level, this only allows us to conclude that all elements of C must be placed in the last chain factor of γ; i. e., Ct e r m(γ m ). Furthermore, as the elements of C do not belong to any strongly connected looped component, Corollary 25 yields that γ m A L T(t e r m(γ m ))+ and γ m A L T(t e r m(γ m ))+?.

In order to prove t e r m(γ m ) = C, assume that Ct e r m(γ m ), i. e., there exists a letter at e r m(γ m )∖C. (Note that this is only possible if n ≥ 2, otherwise, we can conclude that t e r m(γ m ) = C and skip to the next paragraph.) As γ m cannot contain+ or+?, we observe that for all vertices v ∈ Σ∪ {snk}, a γ v holds iff. v = snk. Therefore, there can be no cC with \(a {\rightarrow ^{+}_{\gamma }} c\), and as SOA(γ) covers SOA(S), there is also no cC for which \(a {\rightarrow ^{+}_{S}} c\) holds. In other words, all paths from src to vertices of C must lead through other vertices than a; hence, there is a vertex v on level n−1≥1 with v s c for some cC. Hence, we have v γ a, but not v δ a, which contradicts \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\). We conclude term(γ m ) = C.

We now know that γ m ∈ {A L T(C), A L T(C)?}. As level n is a skip level, there exists a vertex v ∈ {src} ∪(Σ∖C) with v s snk. Hence, A L T(term(C))? = δ n must hold.

Case

(3): If |B|=1, C = ∅, and level n is not a skip level, then δ n = A L T(B m )+ for some set B m ⊆ Σ with B = {A L T(B m )+}. By definition of Soa2Chare, this is only possible if B m is a strongly connected looped component in S O A(S), and for all vertices v ∈ Σ∪ {src}, v s snk holds only if vB m .

If term (γ m )≠B m , then S O A(γ) does not contain the same strongly connected looped component as S O A(S) and S O A(δ), or there is some letter cB m with c γ snk. In either case, there is a contradiction to \(\mathcal {L}(\gamma )\supseteq S\) or \(\mathcal {L}(\gamma )\subseteq \mathcal {L}(\delta )\) and Corollary 25. Hence, term (γ m ) = B m must hold.

Furthermore, if γ m ∈ {A L T(B m ), A L T(B m )?}, B m is also not a strongly connected looped component in SOA(γ) (a contradiction to \(\mathcal {L}(\gamma )\supseteq S\)). Hence, either γ m = A L T(B m )+ or γ m = A L T(B m )+? holds. Assume for the sake of the argument that \({\varepsilon }\in \mathcal {L}(\gamma _{m})\). Then there exists a vertex vt e r m(γ 1γ m−1)∪ {snk} with v γ snk, but v δ δ does not hold (otherwise, n would be a skip level). We conclude γ m = A L T(B m )+ = δ n .

Case

(4): If |B|=1, C = ∅, and level n is a skip level, then δ n = A L T(B m )+? for some set B m ⊆ Σ with B = {A L T(B m )+}. As in Case (3), we are able to derive that either γ m = A L T(B m )+ or γ m = A L T(B m )+?. Taking the skip level into account as in Case (2), we conclude γ m = A L T(B m )+? = δ n .

Case

(5): If |B|≥2 and C = ∅, there exist a k ≥ 2 and k disjoint non-empty sets B mk + 1,…, B m ⊆ Σ with B = {A L T(B mk + 1)+,…, A L T(B m )+}, and δ n = A L T(B mk + 1)+?…A L T(B m )+?. In order to increase readability, let i:=(mk + 1).

According to the definition of Soa2Chare, every B j is a strongly connected looped component in S O A(S), and each B j was placed in level n of the construction. Therefore, due to Corollary 25, for every j with ijm, γ contains a chain factor A L T(B j )+ or A L T(B j )+?. Hence, for all j with ijm, t e r m(γ j ) = B j and γ j ∈ {A L T(B j )+, A L T(B j )+?} must hold – otherwise, S O A(γ) would contain edges to or from the letters of B j that do not occur in S O A(δ), a contradiction to Lemma 11.

Finally, as all vertices in B are on level n of the level construction, there exist vertices u ∈ (Σ∪ {src}) \(\setminus (\bigcup _{j=i}^{m} B_{j})\) and bB m with u s b as well as a vertex b B i with b s snk. In order to ensure these reachabilities in γ, each chain factor γ j must be able to generate ε. Hence, γ j = A L T(B j )+? holds for all j (ijm), and we conclude δ n = γ i γ m .

Case

(6): If B ≠ ∅ and C ≠ ∅, there exist a k ≥ 1 and sets B mk ,…, B m−1 with B = {A L T(B mk )+,…, A L T(B m−1)+}, and

$$\delta_{n}=ALT(B_{m-k})^{+}?{\ldots} ALT(B_{m-1})^{+}? ALT(C)?.$$

Using reasoning that is analogous to Case (5), we are able to conclude that γ j = A L T(B j )+? for all j with (mk) ≤ j≤(m−1).

All that remains to do is proving γ m = A L T(C)?. Once again according to the reasoning we used in all previous cases, term (γ m ) = C must hold, as otherwise, we would introduce edges not present in SOA(δ), or lose edges present in SOA(S). Due to Corollary 25, we know that γ m ∉ {A L T(C)+, A L T(C)+?}; and due to the same reachability argument as for the γ j in Case (5), \({\varepsilon }\in \mathcal {L}(\gamma _{m})\) must hold. We conclude γ m = A L T(C)? and, hence, δ n = γ mk γ m .

As we now know that there is an i with 1 ≤ im such that δ n = γ i γ m , we can combine δ = δ δ n and \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )=\mathcal {L}(\gamma _{1} {\ldots } \gamma _{m})\) to

$$\mathcal{L}(\delta^{\prime}\delta_{n})\supseteq \mathcal{L}(\gamma_{1} {\ldots} \gamma_{i-1}\delta_{n}).$$

By splitting off δ n , we conclude

$$\mathcal{L}(\delta^{\prime})\supseteq \mathcal{L}(\gamma_{1} {\ldots} \gamma_{i-1}).$$

Due to our induction assumption, this implies δ = γ 1γ i−1. Hence, δ = δ δ n = γ 1γ m holds, which completes the proof.

As Claim 1 holds for all n ≥ 1, \(\mathcal {L}(\delta )\supseteq \mathcal {L}(\gamma )\supseteq S\) implies δ = γ (with the caveat that terminals inside chain factors of γ and δ might have a different order). Hence, \(\mathcal {L}(\delta )=\mathcal {L}(\gamma )\), and Lemma 26 follows immediately.

According to Lemma 26, there is no Chare γ for which \(\mathcal {L}\)(Soa2Chare(SOA(S))) \(\supset \mathcal {L}(\gamma )\supseteq S\) holds. As we have, by definition, \(\mathcal {L}\)(Soa2Chare(SOA(S))) ⊇S, we know that the result of Soa2Chare on SOA(S) is Chare-descriptive of S, which concludes the proof of correctness.

5 Descriptive SOREs

In this section, we give the second main algorithm of this paper, which efficiently computes descriptive Sores from given Soas.

5.1 SORE Algorithm

As in Section 4, we use dot-notation to denote the application of subroutines. Likewise, for a given Soa A, we let A.src and A.snk denote the source and the sink of A, respectively. We let V be the set of vertices in A and E the set of edges.

  • For any vertex vV, we let A.pred(v) denote the set of all predecessors of v in A; similarly, A.≻(v) denotes the set of all successors.

  • For any vertex vV, we let \(A.\text {reach}(v) := \{u \in V \mid v {\rightarrow ^{*}_{A}} u\}\) be the set of all vertices reachable from v.

  • “contract” on Soa A takes a subset U of vertices of A and a label . The procedure modifies A as follows. All vertices in U are removed; a new vertex labeled is added; for each edge (v, u)∈E with vVU and uU, we remove (v, u) and add an edge (v, ); similarly, for each edge (u, v)∈E with vVU and uU, we remove (u, v) and replace with an edge (, v). We call this method whenever we identify a subset of vertices for which we can compute a descriptive generalization (the label is then this generalizing Sore).

  • “extract” on Soa A takes as argument a set of vertices U (of A); it does not modify A, but returns a new Soa with copies of all vertices of U as well as two new vertices for source and sink; all edges between vertices of U are copied, all vertices in U having an incoming edge (in A) from outside of U have now an incoming edge from the new source, and all vertices in U having an outgoing edge (in A) to outside of U have now an outgoing edge to the new sink. We use this method whenever we identified a subpart of the Soa to recurse on.

  • “first” returns all vertices v such that the only predecessor of v is the source. These are particularly interesting, since our algorithm will work on the Soa by starting from the source and progressing through the links. In particular, in a cycle-free Soa, all other successors of the source are reachable via some element from A.first().

  • “addEpsilon” on Soa A adds a new vertex labeled ε and an edge from A.src to this new vertex; let UV be the set of all successors of the source which are not in A.first(); for each edge from A.src to an element uU, remove this edge and add an edge from the new ε vertex to u. This is used to be able to treat all successors of the sink equally (in particular, in a cycle-free Soa, this ensures that every node after the source is reachable from the source only by passing through an element from A.first(); this additional vertex makes sure that A.first() is exactly the first layer of the Soa).

  • “exclusive” on Soa A on argument v (a vertex of A) returns the set of all vertices u such that A.src , where Av is defined as the Soa A with the vertex v (and all incident edges) removed. Intuitively, the exclusive set of a vertex v is the set of all vertices which necessarily require v in order to be reached from the source. We will use this method to find sets of vertices to recurse on: whenever a part of the Soa can be exclusively reached via a fixed vertex, we can recurse on this set of exclusive vertices.

  • Finally, the most difficult subroutine is called “bend” and is used to prepare the treatment of strongly connected looped components of the input Soa A. We let U: = A.≻(A.src); with AU we denote the graph which has vertices and edges as in A with all vertices (and incident edges) from U deleted. We let w 1: = A.pred(A.snk) and let w 2 be the set of all vertices dV∖(A.≻(A.src) ∪A.pred(A.snk)) such that there is cw 1 with \(c {\rightarrow ^{+}_{A \setminus U}} d\). We let W: = w 1w 2. Intuitively, W is the set of all elements that can be reached in any number of steps from a predecessor of the sink without crossing a successor of the source. Then the subroutine replaces (bends) all edges from an element in W to a successor of the source by an edge from the same vertex in W to the sink. See Example 28 for an illustration. Note that the application of bend ensures that no element of A.≻(A.src) can be reached from any element of A.pred(A.snk).

Furthermore, we use the following three subroutines for the creation of labels.

  • “plus” on label returns ()+.

  • “concatenate” on labels and returns .

  • “or” on labels and returns | .

The algorithm Soa2Sore is given in Algorithm 2. On a more intuitive level, the algorithm performs the following phases.

  1. (1)

    Recurse on all strongly connected looped components; replace each with a vertex, labeled with the result of the recursion.

  2. (2)

    After the Soa is a directed acyclic graph (DAG), focus on the set F of all vertices which can be reached from the source directly, but not via other vertices; make sure that there are no vertices which can be reached directly and via other vertices (if necessary, add an auxiliary vertex labeled ε).

  3. (3)

    Recurse on the sets of vertices exclusively reachable from a vertex in F and contract these sets to vertices labeled with the result of the recursion.

  4. (4)

    Combine vertices of F with “or,” recurse again on what is exclusively reachable from this new vertex.

  5. (5)

    Once only one item is left in F, merge it with the sink and recurse on the remainder.

figure m

Note that the algorithm introduces ? by way of constructing “or ε.” This can be cleaned up by postprocessing the resulting Sore.

The following theorem states the correctness and the running time of the algorithm.

Theorem 27

The algorithm Soa2Sore, given a Soa A as input, finds a descriptive Sore for \(\mathcal {L}(A)\) in time O(nm), where n is the number of alphabet symbols used in A, and m is the number of transitions in A. Furthermore, this algorithm produces a Sore such that the corresponding Soa has the same strongly connected components as the input Soa, and the same set of successors of the source.

Before we get to the proof of Theorem 27, we give two examples of Soa2Sore. The first example illustrates how strongly connected looped components are treated. The second illustrates the use of “exclusive”.

Example 28

Consider the following Soa.

figure n

The labeled vertices of this Soa consist of a single strongly connected looped component, an application of “bend” computes the set W = {c, d} with w 1 = {c} and w 2 = {d}, which leads to the following Soa.

figure o

After resolving the strongly connected looped component containing a and b (all other are not “looped”) and contract, we get the following.

figure p

We can split off the first vertex twice now (as line 11 applies twice), recursing finally on the remaining Soa as follows.

figure q

This results in d | ε, or, equivalently, d?. Going back through the recursions, we get

$$((ab)^{+}cd?)^{+}. $$

Example 29

Consider now the following Soa.

figure r

For this Soa, line 17 applies and recurses on the upper arc; after contraction, this gives

figure s

which results in (a b) | c as desired (no generalizations were made).

5.2 Proof of Theorem 27

In this section we are concerned with proving Theorem 27. We start with a lemma which is used in its proof. Intuitively, we use the function shown existent in the lemma to turn Sores into a “canonical form”, in order to ease the comparison of the computed Sore with the supposedly smaller descriptive Sore (see the proof for details).

Lemma 30

There is a function f on Sore s such that, for each Sore α, \(\mathcal {L}(f(\alpha )^{+}) = \mathcal {L}(\alpha ^{+}) \setminus \{{\varepsilon }\}\) and, for all aα.≻(α.src) and cα.pred(α.snk) we have .

Proof

We define f as a function on Sores recursively as follows. For all symbols a and Sores α 0, α 1, we let

$$\begin{array}{@{}rcl@{}} f({\varepsilon}) & = & \emptyset;\\ f(a) & = & a;\\ f(\alpha_{0}^{+}) & = & f(\alpha_{0});\\ f(\alpha_{0} \, | \, \alpha_{1}) & = & \left\{\begin{array}{lll} \emptyset, &{if }\mathcal{L}(\alpha_{0}) = \{{\varepsilon}\} = \mathcal{L}(\alpha_{1});\\ f(\alpha_{0}), &{else if }\mathcal{L}(\alpha_{1}) = \{{\varepsilon}\};\\ f(\alpha_{1}), &{else if }\mathcal{L}(\alpha_{0}) = \{{\varepsilon}\};\\ f(\alpha_{0}) \, | \, f(\alpha_{1}), &{otherwise;} \end{array}\right.\\ f(\alpha_{0} \; \cdot \; \alpha_{1}) & = & \left\{\begin{array}{llll} f(\alpha_{0} \, | \, \alpha_{1}), &{if }{\varepsilon} \in \mathcal{L}(\alpha_{0})\cap \mathcal{L}(\alpha_{1});\\ \alpha_{0} \; \cdot \; f(\alpha_{1}), &{else if }{\varepsilon} \in \mathcal{L}(\alpha_{0});\\ f(\alpha_{0}) \; \cdot \; \alpha_{1}, &{else if }{\varepsilon} \in \mathcal{L}(\alpha_{1});\\ \alpha_{0} \; \cdot \; \alpha_{1}, &{otherwise.} \end{array}\right. \end{array} $$

Let a Sore α be given. We omit the straightforward induction which shows \(\mathcal {L}(f(\alpha )^{+}) = \mathcal {L}(\alpha ^{+}) \setminus \{{\varepsilon }\}\).

Let aα.≻(α.src) and cα.pred(α.snk). We show the claim by induction on the syntax tree of f(α). Clearly, the root of f(α) is not labeled+.

Suppose now the root is labeled “or.” Then either a and c are in different subtrees of the root, in which case we have ; or a and c are in the same subtree, in which case the claim follows by induction.

Suppose now the root is labeled with concatenation. We make the following two remarks. If a is in the right subtree of the root, then the left subtree allows ε (as aα.≻(α.src)). Similarly, if c is in the left subtree of the root, then the right subtree allows ε (as cα.pred(α.src)). We consider different cases as follows.

If a and c are both in the left subtree, then the right subtree allows ε, so the claim follows by induction. If a and c are both in the right subtree, then the left subtree allows ε, so the claim follows, again, by induction. If a is in the left subtree, and c in the right, then, trivially, . If c is in the left subtree, and a in the right, then both subtrees allow ε. Thus, the definition of f(α) gives immediately that . □

We are now ready to prove Theorem 27.Proof of Theorem 27 Let a Soa A be given. We proceed by first reasoning about termination and running time. After that, we will inductively show correctness, by assuming all recursive calls to be correct.

Termination and running time

As for the termination, we first note that the algorithm starts by breaking up strongly connected looped components. As remarked at the end of the definition of the “bend” subroutine, the case of line 5 can only apply finitely often, as each “bend” operation breaks up a strongly connected looped component. Line 9 can never apply twice in a row, so it suffices to show that all other cases can only apply finitely often. All later cases contract vertices; this reduces the number of vertices, which can only increase by “addEpsilon”. It is easy to see that for each application of “addEpsilon” at least three vertices are contracted before another application of “addEpsilon”, which shows termination.

We refer to [6] for standard graph algorithms, such as finding strongly connected (looped) components.

As the algorithm never introduces self-loops, the running time on a Soa A is at most the running of A with all self-loops removed plus n. Thus, it suffices to show that Soa2Sore has a running time of O(n m) on self-loop free Soas. Note that we use without further mention that n<m, which implies O(n + m) = O(m).

We first bound the running time on acyclic Soas. We topologically sort the vertices of A (this takes O(m) time). We now iteratively construct an annotation of all the vertices of G with subsets of A.first(), corresponding to what vertices they are reachable from. We start by annotating each vertex of G that corresponds to a vertex vA.first() with {v} and all others with ∅ (in time O(n)). We now iterate through all vertices u from first to last in the topological sort of G and, for each successor w of u, we add to the current annotation of w the annotation of u (assuming unit time for this kind of set operations; overall, this will then take O(m) time). This results in the desired annotation of A, in a total of O(m) time.

Extracting the “exclusive” sets for all elements of A.first() can now be done in O(m) time. From these annotations we can also find a pair of vertices with ⊆-maximal reach-sets in time O(m).

Any two additions of ε-vertices are balanced in between by splitting off of a starting vertex, as given in line 11. As for all other operations, the algorithm can make at most n contractions; hence, there can be only O(n) recursive calls. This results in an overall time of O(n m) for acyclic Soas.

We now turn to the general case. Finding strongly connected looped components takes time O(m), using well-known algorithms, for example Tarjan’s algorithm. Soa2Sore first recurses on all strongly connected looped components, and then on the directed acyclic graph obtained by contracting all strongly connected looped components. The “ bend” operation on a strongly connected looped component splits this component, as no vertex linked to the sink can now reach any of the elements of the “first” set. The running time is maximized when the recursions are as unbalanced as possible; this happens, when each “ bend” operation splits off only one vertex, and the remaining Soa is still strongly connected. This results in splitting off n times, with a time of O(m) for finding strongly connected looped components each time, plus the final work on acyclic Soas.

This shows that the overall running time is O(n m).

Correctness

The statements about strongly connected components and the successors of the source are straightforward: Strongly connected components are only produced by adding+, and that is done exactly on Sores for which the input has a strongly connected components; as for the successors of the source, no case of the algorithm introduces new ones. Furthermore, it is clear that the result is a Sore.

Let a generalized Soa A be given, let A be a copy of A where all labels are replaced with single distinct symbols. Let δ = Soa2Sore(A) and let γ be a Sore such that \(\mathcal {L}(A) \subseteq \mathcal {L}(\gamma ) \subseteq \mathcal {L}(\delta )\).

We argue by induction that \(\mathcal {L}(\delta ) = \mathcal {L}(\gamma )\) (i.e., we assume that Theorem 27 holds for all recursive calls that Soa2Sore makes on A). We distinguish a number of different cases, depending on which clause was used for Soa2Sore(A).

Case 1

: The clause in line 3 or the clause in line 4 was used.

This case is trivial.

Case 2

: The clause in line 5 was used. Let U be as chosen in line 6. Let A 0 = A.extract(U) and B 0 = A.extract(U).bend(); let z be a symbol not in t e r m(A) and B 1 = A.contract(U, z). Let \(\hat {\delta _{0}} =\) Soa2Sore (B 0) and let \(\delta _{0} = \hat {\delta _{0}}^{+}\). We let δ 1 be Soa2Sore (B 1).

Let T be the syntax tree of γ. For each vertex x of T, we call x plussed iff inserting a+ in T at x does not change the language defined by T.

Claim 1.

There is a plussed vertex x in T such that, for the subtree γ 0 rooted at x, we have t e r m(γ 0) = t e r m(U).

Proof of Claim 1. Let u be the plussed vertex furthest down in T such that t e r m(u) contains t e r m(U); such a vertex has to exist in T, as → γ is a superrelation of → A , where U is a strongly connected component.

Let ct e r m(u) and let at e r m(U). Then \(a {\rightarrow ^{+}_{\gamma }} c\) and \(c {\rightarrow ^{+}_{\gamma }} a\); thus, \(a {\rightarrow ^{+}_{\delta }} c\) and \(c {\rightarrow ^{+}_{\delta }} a\), since → δ is a superrelation of → γ . As → δ has the same strongly connected components as → A , and U is the strongly connected component containing a, we get ct e r m(U).

Let f be as shown existent in Lemma 30, and let x be the plussed vertex highest up in T such that t e r m(x) = t e r m(U). Let \(\hat {\gamma _{0}}\) be the subtree of γ rooted at x; let γ 1 be derived from γ by substituting the subtree at x with a leaf labeled z if \({\varepsilon } \not \in \mathcal {L}(\gamma _{0})\) and (z | ε) otherwise. Let \(\gamma _{0} = f(\hat {\gamma _{0}})\). Clearly, it suffices to show that \(\mathcal {L}(\gamma _{0}) = \mathcal {L}(\delta _{0})\) and \(\mathcal {L}(\gamma _{1}) = \mathcal {L}(\delta _{1})\).

Claim 3.

\(\mathcal {L}(B_{1}) \subseteq \mathcal {L}(\gamma _{1}) \subseteq \mathcal {L}(\delta _{1})\).

Proof of Claim 2. In order to avoid unnecessary case distinctions, we first introduce two new and distinct terminal symbols ⊳ and ⊲, where ⊳ is used as a word-start symbol, and ⊲ as a word-end symbol. To this end, we define \(\gamma ^{\prime }_{1}:= \triangleright \gamma _{1} \triangleleft \) (\(\delta ^{\prime }_{1}\), δ , and γ are defined analogously). In addition to this, we define a Soa \(B^{\prime }_{1}\) with \(\mathcal {L}(B^{\prime }_{1})=\triangleright \mathcal {L}(B_{1}) \triangleleft \) and a Soa b with \(\mathcal {L}(B^{\prime })=\triangleright \mathcal {L}(A) \triangleleft \). (This is easily done by inserting new vertices labeled ⊳ or ⊲ between the source and its successors, or the sink and its predecessors, respectively).

We first prove \(\mathcal {L}(B^{\prime }_{1}) \subseteq \mathcal {L}(\gamma ^{\prime }_{1}) \subseteq \mathcal {L}(\delta ^{\prime }_{1})\). After this is established, the claim follows by observing that projection preserves inclusion. \(\underline {\mathcal {L}(B^{\prime }_{1}) \subseteq \mathcal {L}(\gamma ^{\prime }_{1})}:\) Let \(a,b \in term(B^{\prime }_{1}) \setminus \{z\}\) and suppose \(a {\rightarrow _{B^{\prime }_{1}}} b\). We have \(a {\rightarrow _{B^{\prime }}} b\), and, hence, \(a {\rightarrow _{\gamma ^{\prime }}} b\). Then \(a {\rightarrow _{\gamma ^{\prime }_{1}}} b\) follows from the definition of \(\gamma ^{\prime }_{1}\).

Let \(a \in term(B^{\prime }_{1}) \setminus \{z\}\) and suppose \(a {\rightarrow _{B^{\prime }_{1}}} z\). Thus, there is a bt e r m(U) such that \(a {\rightarrow _{B^{\prime }}} b\), and, hence, \(a {\rightarrow _{\gamma ^{\prime }}} b\). Then \(a {\rightarrow _{\gamma ^{\prime }_{1}}} z\) follows from the definition of \(\gamma ^{\prime }_{1}\).

Let \(b \in term(B^{\prime }_{1}) \setminus \{z\}\) and suppose \(z {\rightarrow _{B^{\prime }_{1}}} b\). Thus, there is an at e r m(U) such that \(a {\rightarrow _{B^{\prime }}} b\), and, hence, \(a {\rightarrow _{\gamma ^{\prime }}} b\). Then \(z {\rightarrow _{\gamma ^{\prime }_{1}}} b\) follows from the definition of γ 1.

\(\underline {\mathcal {L}(\gamma ^{\prime }_{1}) \subseteq \mathcal {L}(\delta ^{\prime }_{1}):}\) Let \(a,b \in term(\gamma ^{\prime }_{1}) \setminus \{z\}\) and suppose \(a {\rightarrow _{\gamma ^{\prime }_{1}}} b\). From the definition of \(\gamma ^{\prime }_{1}\) it is now easy to see that \(a {\rightarrow _{\gamma ^{\prime }}} b\), and, hence, \(a {\rightarrow _{\delta ^{\prime }}} b\). Thus, we get \(a {\rightarrow _{\delta ^{\prime }_{1}}} b\).

Let \(a \in term(\gamma ^{\prime }_{1}) \setminus \{z\}\) and suppose \(a {\rightarrow _{\gamma ^{\prime }_{1}}} z\). Thus, there is a bt e r m(U) such that \(a {\rightarrow _{\gamma ^{\prime }}} b\), and, hence, \(a {\rightarrow _{\delta ^{\prime }}} b\). We have now \(a {\rightarrow _{\delta ^{\prime }_{1}}} z\).

Let \(b \in term(\gamma ^{\prime }_{1}) \setminus \{z\}\) and suppose \(z {\rightarrow _{\gamma ^{\prime }_{1}}} b\). Thus, there is an at e r m(U) such that \(a {\rightarrow _{\gamma ^{\prime }}} b\), and, hence, \(a {\rightarrow _{\delta ^{\prime }}} b\). We have now \(z {\rightarrow _{\delta ^{\prime }_{1}}} b\).

Hence, \(\mathcal {L}(B^{\prime }_{1}) \subseteq \mathcal {L}(\gamma ^{\prime }_{1}) \subseteq \mathcal {L}(\delta ^{\prime }_{1})\), which is equivalent to \(\triangleright \mathcal {L}(B_{1})\triangleleft \subseteq \triangleright \mathcal {L}(\gamma _{1})\triangleleft \subseteq \triangleright \mathcal {L}(\delta _{1}) \triangleleft \). As inclusion is preserved under projection, this implies \({\pi _{T}}(\mathcal {L}(B^{\prime }_{1})) \subseteq {\pi _{T}}(\mathcal {L}(\gamma ^{\prime }_{1})) \subseteq {\pi _{T}}(\mathcal {L}(\delta ^{\prime }_{1}))\) which proves the claim (for T:=Σ∖{⊳,⊲}).

Thanks to the claim we can now apply the induction hypothesis to see that \(\mathcal {L}(\gamma _{1}) = \mathcal {L}(\delta _{1})\).

Similarly, we now show γ 0 and δ 0 to be equivalent by showing \(\mathcal {L}(B_{0}) \subseteq \mathcal {L}(\gamma _{0}) \subseteq \mathcal {L}(\delta _{0})\). From the induction hypothesis we know that B 0.≻(B 0.src) = δ 0.≻ (δ 0.src); this shows that γ 0.≻(γ 0.src) has to coincide with these sets. In particular, we have now

$$ \gamma_{0}.\text{succ}(\gamma_{0}.{\mathtt{src}}) = B_{0}.\text{succ}(B_{0}.{\mathtt{src}}) = A_{0}.\text{succ}(A_{0}.{\mathtt{src}}). $$
(1)

Claim 3.

We have that

$$B_{0}.\text{pred}(B_{0}.{\texttt{snk}}) \subseteq \gamma_{0}.\text{pred}(\gamma_{0}.{\texttt{snk}}) \subseteq \delta_{0}.\text{pred}(\delta_{0}.{\mathtt{snk}}). $$

Proof of Claim 3. The statement γ 0.pred(γ 0.snk) ⊆δ 0.pred(δ 0.snk) follows straightforwardly from the choice of γ 0.

Let cB 0.pred(B 0.snk). Suppose first that there is an element dt e r m(A)∖t e r m(U) such that c A d. Then c γ d, and, thus, cγ 0.pred(γ 0.snk). Therefore, A 0.pred(A 0.snk) ⊆γ 0.pred(γ 0.snk).

Suppose now, by way of contradiction, that there is a b with

$$ b\in B_{0}.\text{pred}(B_{0}.{\mathtt{snk}}) \; \wedge \; b\notin \gamma_{0}.\text{pred}(\gamma_{0}.{\mathtt{snk}}). $$
(2)

Then bA 0.pred(A 0.snk). Hence, the edge from b to B 0.snk was added by the bend routine, and we know that bw 2 must hold. Therefore,

$$ b\not\in A_{0}.\text{pred}(A_{0}.{\mathtt{snk}}) \; \wedge \; b\not\in A_{0}.\succ(A_{0}.{\mathtt{src}}). $$
(3)

Furthermore, there exist an a with

$$ a \in A_{0}.\text{succ}(A_{0}.{\mathtt{src}}) \; \wedge \; b{\rightarrow_{A_{0}}} a $$
(4)

and a c such that

$$ b \in B_{0}.\text{pred}(B_{0}.{\mathtt{src}}) $$
(5)

with the property (∗): b can be reached from c without crossing any elements of A 0.succ(A 0.src).

Let T 0 be the syntax tree of γ 0, and let x be the lowest vertex below which all three letters a, b, and c occur. Let \(\gamma ^{\prime }_{0}\) be the subexpression of γ 0 that corresponds to the subtree that has x as root. Due to the choice of x , this vertex must be labeled with a binary operator, which implies that it has a left and right subtree, to which we refer as T L and T R , respectively. We call the corresponding expressions γ L and γ R . Hence, \(\gamma ^{\prime }_{0}\) is either (γ L γ R ), or (γ L | γ R ). The following reasoning applies to both of these two cases.

As c is a predecessor of the sink in γ 0, we know that \(\gamma ^{\prime }_{0}.\text {pred}(\gamma ^{\prime }_{0}.\) snk) ⊆γ 0.pred(γ 0.snk) must hold. This implies γ R .pred(γ R .snk) ⊆γ 0.pred(γ 0.snk). Furthermore, if c occurs in γ L , then γ L .pred(γ L .snk) ⊆γ 0.pred(γ 0.snk) must hold as well. Analogously, we can observe that γ L .≻(γ L .src) ⊆γ 0.≻(γ 0.src) holds; and if a occurs in γ R , then γ R .≻(γ R .src) ⊆γ 0.≻(γ 0.src). We conclude the proof of Claim 2 with the following case analysis.

Case 1

: b and c occur in the same subexpression γ ∈ {γ L , γ R }. As x is the lowest vertex above the three letters a, b, c, the letter a cannot occur in γ . Hence, bγ .pred(γ .snk) must hold in order to allow b γ a (as given by (4)). But as c occurs in γ , this implies bγ 0.pred(γ 0.snk), in contradiction to (2).

Case 2

: b occurs in γ L and c in γ R . We have that b is only reachable from c if b is an element of γ L .≻(γ L .src), or by crossing an element of that set. But because of

$$\gamma_{L}.\text{succ}(\gamma_{L}.{\mathtt{src}})\subseteq \gamma_{0}.\text{succ}(\gamma_{0}.{\mathtt{src}}) \underset{{\scriptstyle (1)}}{=} A_{0}.\text{succ}(A_{0}.{\mathtt{src}}), $$

the former contradicts (3) and the latter cannot happen due to (∗).

Case 3

: b occurs in γ R and c in γ L . If a occurs in γ R , we can observe that the path from c to b (given by (∗)) must cross an element of A 0.succ(A 0.src), a contradiction similar to the previous case. But if a occurs in γ L , then bγ R .pred(γ R .snk) ⊆γ 0.pred(γ 0.snk) follows from b A a given by (4), a contradiction to (2).

Lastly, we turn to pairs of elements from t e r m(U). Claim 4. On t e r m(U), \({\rightarrow _{B_{0}}}\) is a subrelation of \({\rightarrow _{\gamma _{0}}}\), which in turn is a subrelation of \({\rightarrow _{\delta _{0}}}\).Proof of Claim 4 This is straightforward, using the properties of f taken from Lemma 30.

This finishes showing \(\mathcal {L}(B_{0}) \subseteq \mathcal {L}(\gamma _{0}) \subseteq \mathcal {L}(\delta _{0})\); thus, using the induction hypothesis, \(\mathcal {L}(\gamma _{0}) = \mathcal {L}(\delta _{0})\). This finishes the reasoning for this case.

Case 3

: The clause in line 9 was used. This case is trivial from the induction hypothesis, as the language is not changed by the addEpsilon() method.

Case 4

: The clause in line 11 was used.

Let v be the only successor of A.src; let a = v.label(). Note that a is the only successor of γ.src. Let U = t e r m(δ)∖{a}. As A does not have a strongly connected looped component, neither does Soa(γ); thus, we have \(\mathcal {L}(\gamma ) = a \cdot \pi _{U}(\mathcal {L}(\gamma ))\). Let γ equal γ with a replaced by ε and δ =Soa2Sore(A.extract(U)). Then we have \(\mathcal {L}(A.\text {extract}(U)) \subseteq \mathcal {L}(\gamma ^{\prime }) \subseteq \mathcal {L}(\delta ^{\prime })\) and the claim follows by induction.

Case 5

: The clause in line 17 was used. We now know that A is cycle free and, thus, δ does not contain a “+”. Therefore, without loss of generality, γ does not contain a “+” either (the only+ could be on ε or other terminal-free parts, which is unnecessary).

Let v be as chosen in line 17 and a = v.label(). Let U = A.exclusive(v).

Let B 0 = A.extract(U); let z be a symbol not in t e r m(A) and B 1 = A.contract(U, z). Let δ 0 = Soa2Sore (B 0) and let δ 1 = Soa2Sore (B 1). By the induction hypothesis, we have that δ 0.first()={a}. Thus, any word in \(\mathcal {L}(\gamma ) \subseteq \mathcal {L}(\delta )\) that contains an element of U has to start with an a.

Claim 5.

There is a subtree γ 0 of γ such that t e r m(γ 0) = U.

Proof of Claim 5. Let γ 0 be the smallest subtree such that Ut e r m(γ 0). Suppose, by way of contradiction, there is bt e r m(γ 0)∖U. By the definition of U, we have that there is bγ 0.first()∖U. From aγ.first() and A.first() = A.≻(A.src) we get bγ.first(), and, thus, a and b cannot appear in the same word of \(\mathcal {L}(\gamma )\). Thus, there is a subtree β of the syntax tree of γ 0 where the root is labeled with “or” and a and b are in different subtrees. In the child tree containing b there cannot be any elements of U, since all elements of U are reachable from a.

Thus, β cannot be all of γ 0, as γ 0 was chosen smallest. β cannot descend from the left child of γ 0, as then all elements of U in the right subtree are reachable via b (or γ 0 not smallest); similarly, β cannot descend from the right child of γ 0, as then all elements of U in the left subtree are not reachable from a (or γ 0 not smallest).

Let γ 0 be a subtree of γ such that t e r m(γ 0) = U; let γ 1 be derived from γ by substituting the γ 0 with a leaf labeled z. Note that \(\varepsilon \not \in \mathcal {L}(\gamma _{0})\) because of A.≻ (A.snk) = A. first().

We now clearly get \(\mathcal {L}(B_{0}) \subseteq \mathcal {L}(\gamma _{0}) \subseteq \mathcal {L}(\delta _{0})\) and \(\mathcal {L}(B_{1}) \subseteq \mathcal {L}(\gamma _{1}) \subseteq \mathcal {L}(\delta _{1})\). Thus, this case follows from the induction hypothesis, similarly to Case 2.

Case 6

: The clause in line 21 was used.

In this case we know that |A.first()|>1, as no other case applies. Furthermore, we will use without mention that A is cycle free.

Let u, v as chosen in line 21. Let z be a symbol not in t e r m(A). Let B = A.contract({u, v}, z). Let δ 0 be Soa2Sore(B).

Let a = u.label() and b = v.label(). From u, vδ.first() we have that there is a subtree β of γ with “or” at the root and a and b are in different child trees.

Claim 6

:\(\mathcal {L}(\beta )\) is a set of letters.

Proof of Claim 6 Suppose, by way of contradiction, there is a word \(w \in \mathcal {L}(\beta )\) of length ≠1. From A.first() = A.succ(A.src) we get wε; thus, the length of w is >1. Let c be the last symbol of w. As Case 5 does not apply, we have that c is reachable from two different elements of A.first(); let d 0, d 1 be two such elements.

Clearly, d 0 and d 1 are in the same subtree of β; without loss of generality, suppose they are in the same subtree as a. Thus, everything that is reachable in A from both a and b is also reachable from d 0 and d 1; furthermore, c is reachable in A from both d 0 and d 1 but not reachable from b (as c is not in the same subtree β as b). This is a contradiction to the minimality of u, v.

From the claim we get, without loss of generality, that (a | b) is a subexpression of γ; thus, β = (a | b). Let γ 0 be derived from γ by substituting β with z. Clearly, we now have \(\mathcal {L}(B) \subseteq \mathcal {L}(\gamma _{0}) \subseteq \mathcal {L}(\delta _{0})\). From the induction hypothesis we get \(\mathcal {L}(\gamma _{0}) = \mathcal {L}(\delta _{0})\); thus, \(\mathcal {L}(\gamma ) = \mathcal {L}(\delta )\).

6 Beyond Single Occurrences

This section examines two other aspects of the descriptive generalization algorithms in the present paper. In Section 6.1, we consider a possible extension to restricted regular expressions that are not limited to a single number of occurrences of each terminal letter. Section 6.2 briefly discusses the descriptive generalization of language classes that are generated by mechanisms which are more powerful than Soas, but use Soas, Sores, or Chares as hypotheses.

6.1 Learning k-Ores

While Sores and Soas allow only a single occurrence of each terminal letter, Bex et al. [3] introduced the more general concepts of k-occurrence regular expressions (k-Ores) and k-occurrence automata (k-Oas). As might be expected, a k-Ore is a regular expression where every terminal symbol occurs at most k times. Analogously, while a Soa has only a single state for each terminal letter a, a k-Oa allows up to k states A (1),…, A (k), where k ≥ 1. Hence, Soas are 1-Oas, and Sores are 1-Ores.

A k-Oa is called non-deterministic if it has a state that has two successor states with identical labels, and a k-Ore is called non-deterministic if its canonical k-Oa (as defined in the next paragraph) is non-deterministic. Unlike Soas and Sores, which are deterministic by definition, the same does not hold for k-Oas and k-Ores (for further details, see [3]).

Similarly to the way Sores can be translated into Soas, any given k-Ore can be converted into a canonical k-Oa: Given a k-Ore α over some alphabet Σ, we transform α into a Sore α (k) over the marked alphabet Σ(k), which is defined by

$${\Sigma}^{(k)}:= \{a^{(i)}\mid 1\leq i \leq k\}.$$

In other words, every occurrence of some letter a is replaced with an occurrence that is marked with some number (the exact value of each of the marking i is irrelevant to our purposes, as long as every A (i) occurs at most once). We then transform α (k) into a Soa \(\mathcal {A}^{(k)}:= \textsc {Soa}(\alpha ^{(k)})\), and obtain the k-Oa \(\mathcal {A}\) over the alphabet Σ by stripping the markings from the letters in \(\mathcal {A}^{(k)}\) (i. e., every A (i) is replaced with a). Note that \(\mathcal {A}\) does not depend on the choice of the marking. More importantly, we observe that \(\mathcal {L}(\mathcal {A})=\mathcal {L}(\alpha )\), as \(\mathcal {L}(\mathcal {A}^{(k)})=\mathcal {L}(\alpha ^{(k)})\).

Furthermore, note that the characteristic inclusion criterion for Sores (and Soas) becomes merely sufficient for deterministic k-Ores (and deterministic k-Oas). This is easily seen by considering the deterministic 2-Ore α:=(a c | b c) and the deterministic Sore (and, hence, also 2-Ore) β:=(a | b | c). While \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\) holds, the canonical OA of β does not cover the canonical OA of α.

Bex et al. propose an algorithm RWR 2 that transforms k-Oas into k-Ores. This algorithm can be paraphrased as follows: First, the input k-Oa \(\mathcal {A}\) is transformed into a Soa \(\mathcal {A}^{(k)}\) over the marked alphabet Σ(k). Then RWR \(^{2}_{1}\) (i. e., RWR \(^{2}_{\ell }\) with = 1) is used to compute a Sore α (k) (also over Σ(k)) for this Soa. In the last step, the markings are stripped from α (k). The resulting k-Ore is called RWR \(^{2}{\mathcal {A}}\).

Although \(\mathcal {L}\)(RWR \(^{2}(\mathcal {A}))\supseteq \mathcal {L}(\mathcal {A})\) holds, two problems might occur. Firstly, even if \(\mathcal {A}\) is deterministic, RWR \(^{2}\mathcal {A}\) is not necessarily deterministic; and secondly, even if \(\mathcal {L}(\mathcal {A})\) is a k-Ore language, \(\mathcal {L}({\mathtt {RWR}}^{2}{\mathcal {A}})=\mathcal {L}(\mathcal {A})\) is not guaranteed (cf. Section 4.2 of [3]).

Nonetheless, for a large class of k-Oas, the transformation does not change the language, as shown just below (for the definition of Glushkov representations, see [3]).

Theorem 31 (Bex et al. [3])

If a k-Oa \(\mathcal {A}\) is a Glushkov representation of a target k- Ore α, then RWR \(^{2}(\mathcal {A})\) is equivalent to α. Moreover, if α is deterministic, then so is RWR \(^{2}{\mathcal {A}}\).

Due to this result, it is possible to use RWR 2 as a subroutine of a k-Ore inference algorithm called iDREGEX. Ignoring more technical aspects that are not relevant to the present paper, iDREGEX first infers k-Oas and then uses RWR 2 to convert these into k-Ores (for a chosen k).

It is natural to ask what happens if the call of RWR \(^{2}_{1}\) in RWR 2 is replaced with a call of Soa2Sore. We refer to the resulting algorithm as Koa2Kore. In other words, given a k-Oa \(\mathcal {A}\), Koa2Kore \({\mathcal {A}}\) is defined as the result of applying Soa2Sore to \(\mathcal {A}^{(k)}\) and then stripping the markings from Soa2Sore \((\mathcal {A}^{(k)})\).

We first observe that, similar to RWR 2, Koa2Kore neither preserves determinism, nor does it guarantee descriptivity.

Theorem 32

There exist deterministic k- Oa s \(\mathcal {A}_{1},\mathcal {A}_{2}\) such that:

  1. 1.

    Koa2Kore \((\mathcal {A}_{1})\) is not deterministic, and

  2. 2.

    Koa2Kore \((\mathcal {A}_{2})\) is not \(\mathcal {D}\) -descriptive of \(\mathcal {L}(\mathcal {A}_{2})\) , where \(\mathcal {D}\) is the class of deterministic k-Ore s.

Proof

We begin with the first claim and define \(\mathcal {A}_{1}\) to be the following 3-Oa.

figure x

It is easily seen that \(\mathcal {A}_{1}\) is deterministic. By marking the letters in \(\mathcal {A}_{1}\), we obtain the following Soa \(\mathcal {A}_{1}^{(k)}\).

figure y

Note that any other possible marking of the occurrences of a would suffice for our purpose. On this Soa, Soa2Sore returns the Sore A (1)?(b A (2)) A (3), which, after stripping the markings, yields the 3-Ore a?(b a) a. This 3-Ore is not deterministic, which is easily seen by considering its canonical 3-Oa (the single new edge is marked red).

figure z

To prove the second claim, we define the 2-Oa \(\mathcal {A}_{2}\) as follows (this automaton is also used in [3] (Section 4.2) to prove that \(\mathcal {L}({\mathtt {RWR}}^{2}{\mathcal {A}})\neq \mathcal {L}(\mathcal {A})\) can hold).

figure aa

As pointed out in [3], \(\mathcal {L}(\mathcal {A}_{2})\) is generated by the deterministic 2-Ore δ: = b c?a(b a). By applying Soa2Sore to any of the two possible marked version of \(\mathcal {A}_{2}\), we obtain the deterministic 2-Ore α 2: = b c?(a b?)+. We observe that,

$$\begin{array}{@{}rcl@{}} \mathcal{L}(\alpha_{2}) &&=\mathcal{L}(b c? (ab?)^{+})\\ &&=\mathcal{L}(b c? (ab?)(ab?)^{*})\\ &&=\mathcal{L}(b c? a(b?a)^{*}b?)\\ &&\supset \mathcal{L}(b c? a(ba)^{*})=\mathcal{L}(\delta)=\mathcal{L}(\mathcal{A}_{2}) \end{array} $$

holds, which means that α 2 is not descriptive of \(\mathcal {L}(\mathcal {A}_{2})\). □

On a more positive side, an analogous result to Theorem 31 holds as well.

Theorem 33

If a k-Oa \(\mathcal {A}\) is a Glushkov representation of a target k-Ore α, then Koa2Kore \((\mathcal {A})\) is equivalent to α. Moreover, if α is deterministic, then so is Koa2Kore \({\mathcal {A}}\).

The proof is analogous to the proof of Theorem 4.8 in [3] (mutatis mutandis).

While Theorem 32 demonstrates that Soa2Sore cannot be generalized into an algorithm for descriptive k-Ores (at least not in a straightforward manner), Theorem 33 shows that that Koa2Kore can be used as a replacement of RWR 2 in the algorithm iDREGEX in [3].

Although iDREGEX can reliably identify target languages that are k-Ore-languages, Theorme 32 also proves that this replacement of RWR 2 does not lead to an algorithm for descriptive generalization with respect to deterministic k-Ores. In order to solve this problem, one would first need to properly extend Soa2Sore to k-Ores. Accordingly, the authors wish to highlight the following open problem.

Question 1

Is there an efficient algorithm that, given a deterministic k-Oa \(\mathcal {A}\), computes a deterministic k-Ore that is descriptive of \(\mathcal {L}(\mathcal {A})\) (w. r. t. the class of deterministic k-Ores)?

We briefly discuss a possible approach to this problem in Section 8.

As the k-Oa inference step in iDREGEX does not guarantee descriptivity, the following question is probably of equal importance:

Question 2

Is there an efficient algorithm that, given a finite language S, computes a deterministic k-Oa that is descriptive of S (w. r. t. the class of deterministic k-Oas)?

6.2 Approximation of Larger Language Classes

According to Bex et al. [2], a common difficulty in the creation of XML Schema Definitions is that many non-expert users struggle with the distinction between a regular expression and a deterministic regular expression (while this is not explicitly mentioned, the same reasoning also applies to DTDs). One potential solution that is examined in [2] is taking a user-specified non-deterministic regular expression α and transforming it into a deterministic regular expression δ such that (in the terminology of the present paper) δ is \(\mathcal {D}\)-descriptive of \(\mathcal {L}(\alpha )\), where \(\mathcal {D}\) is chosen to be the full class of deterministic regular expressions. As shown in Theorem 7 in [2], this choice of \(\mathcal {D}\) is too large; as there are languages that do not have a descriptive deterministic regular expression.

In contrast to this, the main results of the present paper show that this approach is viable if one is willing to use Chares or Sores instead of the full class of deterministic regular expressions. In fact, not only can one compute descriptive Chares or Sores from non-deterministic regular expression, but from any class of language representation for which one can compute the descriptive Soa from the description of a language L. This includes a wide range of comparatively powerful classes of language description mechanisms (e. g., the class of pushdown automata, or the class of context-free grammars – cf. Hopcroft and Ullman [14], or almost any other introductory textbook). While it might not always be obvious whether these sets can be computed for some given class of descriptors, the following sufficient criterion might serve as first guidance.

Theorem 34

Let \(\mathcal {D}\) be a class of language description mechanisms. Then \(\textsc {Soa}(\mathcal {L}(\delta ))\) can be computed for every \(\delta \in \mathcal {D}\) if there is an algorithm that, given any \(\delta \in \mathcal {D}\) and any regular language R, decides whether \(\mathcal {L}(\delta )\cap R=\emptyset \) holds.

Proof

As the terminal alphabet of every \(\delta \in \mathcal {D}\) is fixed, one can simply construct the regular languages for each possible first letter, each possible last letter, and each possible combination of 2-factors, and check whether these occur in \(\mathcal {L}(\delta )\). Then \(\textsc {Soa}(\mathcal {L}(\delta ))\) can be constructed according to Corollary 7 by adding the appropriate edge for each language where the intersection with \(\mathcal {L}(\delta )\) is non-empty. □

As an alternative to the condition from Theorem 34, one can require that \(\mathcal {D}\) is effectively closed under intersection with regular languages (i. e., that a description for \(\mathcal {L}(\delta )\cap R\) not only exists, but can be computed) and that the emptiness problem for \(\mathcal {D}\) is decidable. As this implies that \(\mathcal {L}(\delta )\cap R=\emptyset \) is decidable, Theorem 34 applies.

Of course, this approach is not without drawbacks. Considering the difference in expressive power, a Sore (or Chare) that is descriptive of a context-free language L might only be a very rough approximation of L. But in addition to this, as the following theorem shows, it is not even possible to decide whether the descriptive expression generates the same language.

Theorem 35

For any arbitrary CFG G, the three following questions are undecidable.

  1. (1)

    Is \(\mathcal {L}(G)\) a Soa language?

  2. (2)

    Is \(\mathcal {L}(G)\) a Sore language?

  3. (3)

    Is \(\mathcal {L}(G)\) a Chare language?

Proof

We proof the theorem for all three cases at once. This proof is a slight modification of the proof of Theorem 8.11 in Hopcroft and Ullman [14]Footnote 5 for the undecidability of the question whether \(\mathcal {L}(G)={\Sigma }^{*}\) holds for an arbitrary CFG G.

In that proof, Hopcroft and Ullman show that, given an arbitrary Turing machine M, one can effectively construct a CFG G M with terminal alphabet Σ=Γ∪Q∪{#} such that \(\mathcal {L}(G_{M})={\Sigma }^{*}\) holds if and only if \(\mathcal {L}(M)=\emptyset \). In particular, their construction defines \(\mathcal {L}(G_{M})\) to be the set of invalid computations of M, which we shall refer to as I M . Basically, I M is the set of all strings that do not encode accepting runs of M (for the exact definition, see [14], Chapter 8.6).

By the definition of invalid computations, f i r s t(I M ) = l a s t(I M )=Σ and g r a m 2(I M )=Σ2 must hold. Hence, Soa(I M )=Σ holds for every Turing machine M. Accordingly, I M is a Soa language if and only if I M = Σ (otherwise, we would arrive at the contradictory observation that Soa(I M )=Σ is not Soa-descriptive of I M ).

Therefore, given an arbitrary Turing machine M, one can effectively construct a CFG G M such that \(\mathcal {L}(G_{M})\) is a Soa language if and only if \(\mathcal {L}(M)=\emptyset \). The question whether \(\mathcal {L}(M)=\emptyset \) holds for an arbitrary Turing machine M is undecidable (again, cf. [14]); hence, the claim follows for Soa languages.

As a descriptive Sore (or a descriptive Chare) cannot be less general than the descriptive Soa, the claim immediately follows for Sore languages and Chare languages as well. □

Theorem 35 shows that, while it is possible to transform CFGs into each of Soas, Sores, and Chares with the guarantee of a minimal generalization, it is not possible to tell whether this step causes a proper generalization (this happens if the original language is not a Soa-, Sore-, or Chare language), or whether the language remains unchanged (if the original language is expressible in the respective model). Hence, it is not only impossible to decide how much information is lost during the transformation (or how many new words are introduced), but also whether there is any loss of information at all.

Note that this result can be adapted to all those language description mechanisms that can express I M , or similarly constructed encodings of invalid computations of Turing machines.

7 Example DTDs

This section contains some example element type declarations that were obtained by running a prototype implementation of Soa2Chare and Soa2Sore Footnote 6 against a sample XML database, as well as a comparison to the declarations from the original DTD. These examples illustrate what kind of DTDs the algorithms generate, and what insights they might offer into the analyzed data.

The algorithms were tested against the version of the Mondial database [17] that, according to the website, has been revised in summer 2009 (the corresponding DTD states a revision date of April 2009). Note that this version of Mondial considerably differs from the older version provided by [18], which was used for the experimental evaluation by Bex et al. [4]. Most importantly, the XML file and the DTD from [17] are consistent with the data, which is not the case for [18] (as already pointed out by [4]).

First, note that with the single exception of country, all element type declarations in the Mondial DTD are Chares; and all are Sores. While it would have been interesting to examine the generalization process on an example where the provided DTD contains a declaration that is not a Sore, all examples that the authors were able to locate contained only single occurrences of element names.

As most of the other element type declarations are trivial, the test focussed on the following elements (listed with their respective number of occurrences in the XML file): city (3261), country (241), desert (62), estuary (220), island (281), lake (132), mountain (242), organization (152), province (1531),river (221), sea (40), and source (216).

The element province is the only of the examined elements for which the both computed Chare and the computed Sore are identical to the original declaration:

figure ab

For each of the elements estuary, mountain, organization, sea, and source, the computed Chare is identical to the computed Sore, which in turn is less general than the definition in the DTD. In each of the cases, the only difference between is that elements that are marked as optional in the DTD either always occur in the XML file, or do not occur at all (which means that in the former case, ? is omitted or > is replaced with +, and in the latter case, optional elements from the original DTD do not appear in the inferred element type declaration).

The situation is similar, but more interesting, for the elements island andriver. Here, the inferred Chares are identical to the declaration in the DTD, but each of the inferred Sores is less general. As an example, consider the element island.

  • original declaration in DTD and inferred Chare: name,islands?,located⋆,area?,elevation?,longitude?, latitude?

  • inferred Sore: name,islands?,located⋆,area?,elevation?, (longitude,latitude)?

In the original declaration, longitude and latitude are both optional. But as evidenced by the descriptive Sore, the two elements are not used independently – neither of them can occur without the other. Unlike Chares, Sores are able to express such dependencies.

The results for the elements city, desert, and lake can be understood as a combination of the previous phenomena. Due to removal of options, the Chare is less general than the original declaration; and the Sore is even less general due to dependencies as in the previous example.

The only case where the language that is generated by the inferred descriptive Chare is incomparable to the language from the original declaration is for the element country. As already mentioned above, this declaration is the only declaration in the DTD that is not a Chare.

figure ac

It is easily seen that the language that is described by this expression is not a Chare language, due to the subexpression (province+|city+) .

Of course, the presence of such a non-Chare expression in the DTD does not mean that it is necessary to describe the actual data in the XML file. For example, it might be possible that this subexpression is too general, and that the less general subexpression (province|city) is a better description of the data. But this is not the case, as the following descriptive Chare shows.

figure ad

Apparently, the non-Chare subexpression is not too general for the actual data, as the descriptive Chare replaces it with the more general expression province⋆,city⋆.

The only reason that the language of the inferred Chare is incomparable to one of the original declaration (instead of being strictly more general) is the subexpression encompassed+ instead of encompassed⋆.

In contrast to this, the inferred descriptive Sore is less general than the declaration from the DTD:

figure ae

Three subexpressions of this Sore are particularly noteworthy. First, note that (province+|city+) is present, as in the original declaration. Second, with (population_growth,infant_mortality?)?, we have a dependency in the actual XML data that was not expressed in the DTD (similar to longitude and latitude in the previous example): The elements population_growth and infant_mortality are both optional, but the latter never appears without the former. Finally, the subexpression (gdp_total,gdp_agri?,(gdp_ind,gdp_serv)?)? is another case of such a dependency.

Although the Mondial XML file might be considered comparatively small, and its DTD rather simple, the examples in the present section should provide some insights into the expressive power of Chares and Sores. In particular, these examples illustrate that Sores are able to express a certain kind of dependency for optional elements. Nonetheless, it should be mentioned that in order to be Sore-expressible, this dependency has to be local (e. g., as for (longitude,latitude)?).

In order to illustrate such an inexpressible dependency, consider the sample S:={a b c, b}. While a is present if and only if c is present, the two letters are always separated by b. Using Soa2Sore on Soa(S) yields the descriptive Sore a?b c?, which does not express this dependency (and is, in fact, a Chare).

8 Conclusions and Further Work

This paper introduces algorithms for inferring descriptive Sores and descriptive Chares: First, use 2T-INF to compute a descriptive Soa, then use Soa2Sore or Soa2Chare to turn this automaton into a Sore or a Chare.

In [4], Bex et al. state that their schema inference algorithms “outperform existing algorithms in accuracy, conciseness, and speed”. Considering the results presented in Sections 3 to 5, the authors of the present paper feel confident to suggest that their new strategies outperform the algorithms from [4] at least with respect to both accuracy and speed. In order to examine the potential practical value of these results, an extensive experimental evaluation of the algorithms would be very interesting. This would also give the opportunity to evaluate the quality of the results of the algorithms, for example with respect to different conciseness measures or how well they describe the target language.

We now discuss possible extensions, and possible directions for further work. In order to overcome the problem that Sores and Chares cannot count (beyond the trivial case of distinguishing between 0 and 1), Bex et al. [4] (Section 8) propose extending these models with numerical predicates; i. e., one could write a ≥ 1,≤3, with \(\mathcal {L}(a^{\geq 1,\leq 3})=\{a,aa,aaa\}\). With an additional post-processing step, the algorithms in [4] can be used to infer Chares and Sores that are extended with counting. This extension can also be adapted to the approaches in the present paper. Basically, one replaces each+ or+? with appropriate bound that describes how often the expression under the+ is repeated; e. g., the sample {a, a a a} would lead to descriptive expression a +, additional post-processing would turn this into a ≥ 1,≤3. Note that, in the form described in [4], this approach can only learn finite languages, as positive data does not allow do distinguish between a + and a n for sufficiently large n. But this can be fixed by providing the post-processing algorithm with an additional threshold t, and removing those upper bounds ≤ n for which nt with unbounded; e. g., a ≥ 2, ≤ n with nt would become a ≥ 2.

On the topic of probabilistic learning, if one is willing to fix a set of probability distributions on the sample space, the learning algorithms could be adapted to feature a variant of stochastic finite learning (introduced by Rossmanith and Zeugmann [20]). It might be possible to derive algorithms which, with high probability, give descriptive generalizations from a very small set of (randomly chosen) examples. This could lead to inference algorithms that do not need to process the whole input, but only a random subset, which might be interesting for very large datasets.

From the authors’ point of view, Questions 1 and 2 (cf. Section 6.1) are the most interesting. In other words: Is it possible to extend the inference algorithms discussed in the present paper from Soas and Sores to deterministic k-Oas and deterministic k-Ores? It seems that one would need to develop not only a good generalization of Soas, but also a “good” inclusion criterion, preferably syntactic. This idea is based on the following observation: While the results in the present paper make no direct use of the results and techniques that Freydenberger and Reidenbach [10] developed for descriptive generalization of pattern languages, both papers rely heavily on the fact that the inclusion problem for the respective language classes has a syntactic criterion for inclusion.

The proofs on descriptive generalization of pattern languages in [10] rely on the fact that inclusion for terminal-free E-pattern languages is characterized by the existence of a morphism which maps the pattern that generates the superlanguage to the pattern that generates the sublanguage. This criterion is a versatile tool to prove the nonexistence of a (pattern) language between the target language and the language of a descriptive pattern. While the proofs of the present paper cannot make any direct use of the proofs from [10], the approaches are similar conceptually. In particular, the line of reasoning in which the correctness proofs of Soa2Chare and Soa2Sore use the fact that the inclusion problem for Sores (and Chares) is characterized by the covering of the respective Soas is structurally similar to the proofs for pattern languages.

Moreover, although deciding whether such a pattern morphism exists is NP-complete, the techniques in [10] are not affected by the computational hardness. Hence, the hardness results on the decidability of the k-Ore-inclusion problem presented by Martens et al. [15] do not exclude the existence of such a criterion. This leaves room for hope that Soa2Sore can be extended to k-Ores with k ≥ 2.