Keywords

1 Introduction

Regular expression are widely used in information extraction, network security, database management, programming languages, etc. Nowadays, mining potential knowledge from sequence data has become a common task in many research areas and application scenarios [9, 20, 24, 27]. The technologies of learning regular expressions have also obtained more and more attention and development. For example, many XML documents are not accompanied by a schema, or a valid schema [1, 4, 5, 23], learning regular expressions from XML documents will facilitate the diverse applications of XML Schema, such as data processing, automatic data integration, and static analysis of transformations [10, 21, 22]. In this paper, we focus on learning regular expressions from XML documents.

For any given positive data, Gold specified that the class of regular expressions cannot be learned [15]. Even Bex et al. claimed that the class of deterministic regular expressions cannot be learned [3]. Therefore, there are many works focusing on learning subclasses of deterministic regular expressions [2, 3, 6, 7, 11, 12]. Deterministic regular expressions [8] require that each symbol in the input word can be unambiguously matched to a position in the regular expression without looking ahead in the word. Single-occurrence regular expressions (SOREs) [6, 7] are classic subclass of deterministic regular expressions (standard). However, SOREs do not support counting, which is an extension of standard regular expressions used in XML Schema [14, 16,17,18,19, 25, 26]. Then, we propose a restricted subclass of single-occurrence regular expressions with counting (RCsores). Our experiments (see Table 3) showed that the proportion of RCsores is 89.45% for 425,275 regular expressions extracted from XSD files, which were grabbed from Open Geospatial Consortium (OGC) XML Schema repositoryFootnote 1. I.e., the majority of schemas in above real-world XSD files use RCsores. Therefore, it is necessary to study a learning algorithm for RCsore. Compared with Gold-style learning [15], the descriptive generalization [12, 13] does not require to learn an exact representation of the target language, but can lead to a compact and powerful model [13]. Thus, our learning algorithm is based on the descriptive generalization [12, 13].

For learning algorithms of SOREs, Bex et al. [7] proposed RWR and RWR\(_{\ell }^2\) [7]. Freydenberger et al. [12] presented the learning algorithm Soa2Sore [12]. Additionally, [7] (resp. [12]) mentioned the future work, which is that SOREs extended with counting can be learnt by an additional post-processing step following the algorithm RWR (resp. Soa2Sore). However, the additional post-processing may result in the problem of overgeneralization [25]. For solving this problem, Wang et al. [25] proposed the class ECsores (see Definition 2), and the corresponding learning algorithm InfECsore [25]. However, although the ECsore learnt by InfECsore is descriptive of any given finite sample, the recall of InfECsore is lowerFootnote 2. Additionally, every possibly repeated subexpression of the ECsore can be extended with counting, then the algorithm InfECsore needs plenty of accurate counting such that it is not efficient to process larger samples. Wang et al. [26] also proposed a subclass cSOREs, which are a subclass of ECsore, and the corresponding learning algorithm InfcSORE [26], but the learnt cSORE is not descriptive of any given finite sampleFootnote 3. Therefore, we propose a new subclass RCsore and the corresponding method for learning RCsore. Although RCsores are also subclass of ECsores, for any given finite sample, our algorithm not only can ensure the learnt RCsore is descriptive of the given finite sample (w.r.t. the class of RCsores), but also can ensure that the recall for the expression derived by our algorithm can be higher than that for the expression learnt by InfECsore. Moreover, for a smaller sample, the learnt RCsore has better generalization ability (higher precision and recall) than the learnt ECsore. And the learning algorithm of RCsore is more efficient than that of ECsore for processing larger samples.

The main contributions of this paper are as follows.

  • We infer a SORE and construct an equivalent countable finite automaton (CFA) [25].

  • The CFA runs on the given finite sample to obtain an updated CFA, which has updated the counting operators that will occur in an RCsore.

  • We convert the updated CFA to an RCsore and prove that the generated RCsore is descriptive of any given finite language.

The paper is structured as follows. Section 2 gives the basic definitions. Section 3 presents the learning algorithm of the RCsore, and proves the RCsore generated by our algorithm is descriptive of any given finite language. Section 4 presents experiments. Section 5 concludes the paper.

2 Preliminaries

2.1 Regular Expression with Counting

Let \(\varSigma \) be a finite alphabet of symbols. \(\mathcal {R}_c\) is a set (non-empty) of regular expressions with counting over \(\varSigma \). \(\varepsilon \), \(a\!\in \!\varSigma \) are regular expressions in \(\mathcal {R}_c\). For regular expressions \(r_1,r_2\!\in \!\mathcal {R}_c\), the disjunction \((r_1|r_2)\), the concatenate \((r_1\cdot r_2)\), the Kleene-star \(r_1^{*}\), and counting (numerical occurrence constraints [14]) \(r_1^{[m,n]}\) are also regular expressions in \(\mathcal {R}_c\). \(m \!\in \! \mathbb {N}\), \(n \!\in \! \mathbb {N}_{/1}\), \(\mathbb {N}\!=\!\{1,2,3,\cdots \}\), \(\mathbb {N}_{/1}\!=\!\{2,3,4,...\} \!\cup \! \{+\infty \}\), and \(m \!\le \! n\). For a regular expression \(r\in \mathcal {R}_c\), \(\mathcal {L}(r^{[m,n]})\!=\!\{w_1\cdots w_i|w_1,\cdots ,w_i\!\in \! \mathcal {L}(r),m\!\le \! i\!\le \! n\}\). Note that \(r^{+}\), r?, and \(r^{*}\) are used as abbreviations of \(r^{[1,+\infty ]}\), \(r|\varepsilon \), and \(r^{[1,+\infty ]}|\varepsilon \), respectively. Usually, we omit concatenation operators in examples. |r| denotes the length of r, which is the number of symbols and operators occurring in r plus the sizes of the binary representations of the integers [14]. For a finite sample S, |S| denotes the number of strings in S. \(\varnothing \) denotes the empty set. For space consideration, all omitted proofs can be found at http://github.com/GraceFun/InfRCsore.

2.2 SORE, ECsore and RCsore

SORE is defined as follows.

Definition 1

(SORE [6, 7]). Let \(\varSigma \) be a finite alphabet. A single-occurrence regular expression (SORE) is a standard regular expression over \(\varSigma \) in which every terminal symbol occurs at most once.

Example 1

\((ab)^+\) is a SORE, while \((ab)^{+}a\) is not.

Definition 2

(ECsore [25]). Let \(\varSigma \) be a finite alphabet. An ECsore is a regular expression with counting over \(\varSigma \) in which every terminal symbol occurs at most once. For a regular expression r, an ECsore forbids immediately nested counters, expressions of form (r?)? and \((r?)^{[m,n]}\).

ECsore does not use the Kleene-star and the iteration operations. And ECsores are deterministic by definition.

Definition 3

(RCsore). Let \(\varSigma \) be a finite alphabet. An RCsore is an ECsore over \(\varSigma \). For regular expressions \(r_1\), \(r_2\) and \(r_3\), an RCsore forbids expressions of form \((r_1r_2r_3)^{[m_1,n_1]}\) where \(\varepsilon \!\in \!\mathcal {L}(r_1)\), \(\varepsilon \!\in \!\mathcal {L}(r_3)\) and \(r_2\!\in \!\{e^{[m_2,n_2]},e?\}\) for regular expression e \((\varepsilon \not \in \mathcal {L}(e))\).

According to the definition, RCsores are a subclass of ECsores. ECsores are deterministic regular expressions, so are the RCsores.

Example 2

\((a|b^{[1,2]})^{[3,4]}(c?d)^{[1,+\infty ]}\), \((a^{[3,4]}b)^{[1,2]}\), and \(((a?b?|c)(d^{[2,3]})?)^{[1,2]}\) are RCsores, also ECsores, while \(a?b^+a\) is not a SORE, therefore neither an RCsore nor an ECsore. However, the expressions \((a?b^{[1,2]}c?)^{[1,2]}\) and \((a?b?c?)^{[1,2]}\) are ECsores, not RCsores. \((a^{[1,2]})^{[1,2]}\), \(((a^{[1,2]})?)^{[1,2]}\) and \(((a^{[1,2]})?)?\) are forbidden.

2.3 Descriptivity

We give the notion of descriptive expressions and automata.

Definition 4

(Descriptivity [12]). Let \(\mathcal {D}\) be a class of regular expressions or finite automata over some alphabet \(\varSigma \). A \(\delta \in \mathcal {D}\) is called \(\mathcal {D}\)-descriptive of a non-empty language \(S\subseteq \varSigma ^*\) if \(\mathcal {L} (\delta )\supseteq S\), and there is no \(\gamma \in \mathcal {D}\) such that \(\mathcal {L} (\delta )\supset \mathcal {L} (\gamma )\supseteq S\).

If a class \(\mathcal {D}\) is clear from the context, we simply write descriptive instead of \(\mathcal {D}\)-descriptive.

Proposition 1

Let \(\varSigma \) be a finite alphabet. There exists an RCsore-descriptive RCsore r for every language \(\mathcal {L} \subseteq \varSigma ^*\).

2.4 Countable Finite Automaton

Definition 5

(Countable Finite Automaton [25]). A Countable Finite Automaton (CFA) is a tuple \((Q,Q_c,\varSigma ,\mathcal {C},q_0,\) \(q_f,\varPhi ,\) \(\mathsf {U},\mathsf {L})\). The members of the tuple are described as follows:

  • \(\varSigma \) is a finite and non-empty alphabet.

  • \(q_0\) and \(q_f:\) \(q_0\) is the initial state, \(q_f\) is the unique final state.

  • Q is a finite set of states. \(Q=\varSigma \cup \{q_0,q_f\} \cup \{+_i\}_{i \in \mathbb {N}}\).

  • \(Q_c\subset Q\) is a finite set of counter states. Counter state is a state q (\(q\in \varSigma \)) that can directly transit to itself, or a state \(+_i\). For each subexpression (excluding single symbol \(a\in \varSigma \)) under the iteration operator, we associate a unique counter state \(+_i\) to count the minimum and maximum number of repetitions of the subexpression, respectively.

  • \(\mathcal {C}\) is finite set of counter variables that are used for counting the number of repetitions of the subexpressions under the iteration operators. \(\mathcal {C}=\{c_{q}|q\in Q_c\}\), for each counter state q, we also associate a counter variable \(c_{q}\).

  • \(\mathsf {U}\!=\!\{u(q)|q\in Q_c\}\), \(\mathsf {L}\!=\!\{l(q)|q\in Q_c\}\). For each subexpression under the iteration operator, we associate a unique counter state q such that l(q) and u(q) are the minimum and maximum number of repetitions of the subexpression, respectively.

  • \(\varPhi \) maps each state \(q\!\in \! Q\) to a set of tuples consisting of a state \(p\!\in \! Q\) and two update instructions. \(\varPhi \): \(Q \mapsto \wp (Q \times ((\mathsf {L} \times \mathsf {U} \mapsto (\mathbf {Min}(\mathsf {L} \times \mathcal {C}),\mathbf {Max}(\mathsf {U} \times \mathcal {C})))\cup \{\emptyset \}) \times ((\mathcal {C} \mapsto \{ {\mathbf {res,inc}}\})\cup \{\emptyset \}))\). (\(\emptyset \) denotes empty instruction.)

Definition 6

(Transition Function of a CFA [25]). The transition function \(\delta \) of a CFA \((Q,Q_c,\varSigma \), \(\mathcal {C},q_0,q_f,\varPhi ,\mathsf {U},\mathsf {L})\) is defined for any configuration \((q,\gamma ,\theta )\) and the letter \(y \in \varSigma \cup \{\dashv \}\)

  1. (1)

    \(y\!\in \! \varSigma :\) \(\delta ((q,\gamma ,\theta ), y)\) \(=\!\{(z,f_{\alpha }(\gamma , \theta ),g_{\beta }(\theta ))|(z,\alpha ,\beta ) \!\in \! \varPhi (q) \wedge (z=y \vee ((y,\alpha ,\beta )\not \in \varPhi (q) \wedge z\!\in \! \{+_i\}_{i\in \mathbb {N}})) \}\).

  2. (2)

    \(y= \dashv :\) \(\delta ((q,\gamma ,\theta ), \dashv )\) \(=\{(z,f_{\alpha }(\gamma , \theta ),g_{\beta }(\theta ))|(z,\alpha ,\beta ) \in \varPhi (q) \wedge (z=q_f \vee z\in \{+_i\}_{i\in \mathbb {N}}) \}\).

3 Inference of RCsores

Our learning algorithm works in the following steps.

figure a

(1) We infer a SORE for a given finite sample. (2) A CFA is equivalently transformed from the SORE obtained from (1). (3) The CFA transformed from step (2) runs on the same finite sample used in step (1) to obtain an updated CFA, which has updated the counting operators that will occur in an RCsore. (4) We convert the updated CFA in step (3) to an RCsore.

Algorithm 1 is the framework of our learning algorithm. Algorithm SOA [12] constructs the single-occurrence automaton (SOA) [7, 12] for the given finite sample S. Algorithm InfSore is described in Sect. 3.1, algorithm \({ ConsCFA}\) is given in Sect. 3.2, algorithm Counting is showed in [25], algorithm GenRCsore is presented in Sect. 3.4.

3.1 Inferring Standard Deterministic Regular Expression: SORE

The problem of learning SORE was solved by Bex et al. and Freydenberger et al. Bex et al. proposed the learning algorithm RWR [7] and its variants. Freydenberger et al. [12] proved the results of RWR with its variants are not descriptive of any given finite sample, and then presented the learning algorithm Soa2Sore [12]. However, the SORE learnt by Soa2Sore is descriptive of the language, which is the set of the strings accepted by the SOA that is built for the given finite sample [12]. Despite of that, we still can infer a SORE such that an RCsore, which is descriptive of the given finite sample, can be derived from the obtained SORE.

Algorithm 2 learns a SORE from the given finite sample. First, a SORE is inferred by Soa2Sore. Then, the SORE is converted to a normal form (SORE). Theorem 1 demonstrates that the normal form is more approximate to the given finite sample than the SORE learnt by Soa2Sore.

figure b

In Algorithm 2, if the SORE \(r_0\) does not contain any one expression of the forms \(r_{f_1}\), \(r_{f_2}\) and \(r_{f_3}\) (which are specified in lines 4, 2 and 3, respectively), then InfSore directly outputs \(r_0\), i.e., \(r_s=r_0\). Note that, except for case (1) (in line 5), other cases are equivalent conversions for \(r_0\). The conversion in case (2) (in line 7) is mainly used to easily construct a CFA in the next section and track as many subexpressions as possible (which can be repeated) in a SORE. For processing \(r_0\) to a normal form \(r_s\), it takes \(\mathcal {O}(|r_0|)\) time. Let the built SOA in line 1 contain \(n_s\) nodes and \(t_s\) transitions. Soa2Sore takes \(\mathcal {O}(n_st_s)\) time to infer a SORE. Thus, the time complexity of algorithm InfSore is \(\mathcal {O}(n_st_s)\) (\(n_st_s\!>\!|r_0|\)).

Example 3

For sample \(S\!=\!\{a,acc,acbb,bab\}\), the result of algorithm Soa2Sore is \(r_0\!=\!((a(c^+)?)|b)^+\). Let the SORE \(r_s:=\)InfSore(SOA(S)), then the SORE \(r_s\!=\!((a(c^+)?)^+|b^+)^+\).

Theorem 1

For any given finite sample S, let \(r_0=Soa2Sore(SOA(S))\), and let \(r_s:=\)InfSore(SOA(S)), then \(\mathcal {L}(r_0)\supseteq \mathcal {L}(r_s)\supseteq S\).

According to Theorem 1, \(\mathcal {L}(r_s)\) is more approximate to the given finite sample than \(\mathcal {L}(r_0)\). Therefore, we can obtain a descriptive RCsore, which is extended from the expression of form \(r_s\).

3.2 Translating SORE to CFA

To avoid plenty of accurate counting in a CFA, the CFA should be constructed from a specific structure, instead of being learnt from a given finite sample [25]. Therefore, in this section, we present how to translate a SORE to a CFA. First, we construct the state-transition diagram of a CFA by traversing the syntax tree of the SORE, which is obtained from Sect. 3.1. Then, the detailed descriptions of the CFA are similar with that described in [25]. Theorem 2 shows that an equivalent CFA can be transformed from an RCsore.

Fig. 1.
figure 1

The syntax tree of expression \(((a(c^+)?)^+|b^+)^+\).

Algorithm 3 first constructs the state-transition diagram of a CFA by using Algorithm 4, then presents the detailed descriptions of the CFA. The state-transition diagram of a CFA is a finite directed graph, denoted by G. Algorithm 4 constructs a directed graph G by traversing a syntax tree. The entire process is similar to the preorder traversal of the binary tree. For a syntax tree T, T.L and T.R denote the left subtree and the right subtree of T, respectively. For a graph G, \(G.\!\prec \!(v)\) denotes the set of all immediate predecessors of v in G, \(G.\!\succ \!(v)\) denotes the set of all immediate successors of v in G. Some subroutines in Algorithm 4 are as follows.

figure c

\(Conn_G(t,G_1,G_2)\). According to label t, a new graph G is constructed by connecting graphs \(G_1\) and \(G_2\). If t = ‘\(\cdot \)’, then add edges \(\{(v_1,v_2)|v_1 \!\in \! G_1.\!\prec \!(q_f), v_2 \!\in \! G_2.\!\succ \!(q_0)\}\); remove nodes \(G_1.q_f\), \(G_2.q_0\) and their associated edges; let \(G.q_0\!=\!G_1.q_0\). If t = ‘|’, then add new nodes \(q_0\), \(q_f\); add edges \(\{(q_0,v_1)|v_1\) \(\in \! G_1.\!\succ \!(q_0)\!\cup \! G_2.\!\succ \!(q_0)\}\), and \(\{(v_2,q_f)|v_2\!\in \! G_1.\!\prec \!(q_f)\!\cup \! G_2.\!\prec \!(q_f)\}\); remove nodes \(G_1.q_0\), \(G_1.q_f\), \(G_2.q_0\), \(G_2.q_f\) and their associated edges; let \(G.q_0\!=\!q_0\).

figure d

\(Add^+(G,+_i)\). G is a graph, and \(+_i\) (a counter state in CFA) is a node. \(Add^+\) adds node \(+_i\) (initially, \(i=1\)) into the graph G. Add new node \(q_f\); let \(\mathcal {R}_{+_i}=\{v|v \in G.\!\succ \!(q_0)\}\); add edges \(\{(+_i,v_1)|v_1 \in G.\!\succ \!(q_0)\}\); add edges \(\{(v_2,+_i)|v_2 \!\in \! G.\!\prec \!(q_f)\}\); remove node \(G.q_f\) and its associated edges; add edge \((+_i,q_f)\). The set of \(\mathcal {R}_{+_i}\) is established to specify the transition entrances for state \(+_i\) to count the minimum and maximum number of repetitions of the subexpression under the iteration operator. Each \(\mathcal {R}_{+_i}\) is a global variable. Let \(\mathcal {R}=\{\mathcal {R}_{+_i}\}_{i\in \mathbb {N}}\).

In Algorithm 3, after the state-transition diagram G of a CFA is constructed, the CFA \(\mathcal {A}\) is then obtained. In line 2, [25] shows the detailed descriptions of the CFA \(\mathcal {A}\). Note that, \(\varPhi (\mathcal {R})\) denotes that \(\mathcal {R}\) is a parameter in \(\varPhi \).

For any SORE r obtained in Sect. 3.1, the time complexity of constructing the corresponding syntax tree is \(\mathcal {O}(|r|)\), and the preorder traversal of the syntax tree used to construct the state-transition diagram of a CFA also requires \(\mathcal {O}(|r|)\) time. Therefore, the time complexity of constructing a CFA is \(\mathcal {O}(|r|)\).

Example 4

For the expression \( ((a(c^+)?)^+|b^+)^+\), the syntax tree can be seen in Fig. 1. The corresponding state-transition diagram can be seen in Fig. 2(a).

Theorem 2

For any given SORE r, there is a CFA \(\mathcal {A}\) such that \(\mathcal {L}(\mathcal {A})=\mathcal {L}(r)\).

Fig. 2.
figure 2

(a) is the CFA \(\mathcal {A}\) for regular language \(\mathcal {L}(((a(c^+)?)^+|b^+)^+)\). The label of the transition edge is \((y;\alpha _i;\beta _j)\) \((i,j\!\in \!\mathbb {N})\), y (\(y\!\in \! \varSigma \cup \{\dashv \}\)) is a current letter; (b) specifies that, \(\alpha _i\) is an update instruction for the lower bound and upper bound variables, and \(\beta _j\) is an update instruction for the counter variable.

3.3 Counting with CFA

The constructed CFA in Sect. 3.2 runs on the given finite sample, which is the same set of strings used to generate the SORE in Sect. 3.1. The CFA counts the minimum and maximum number of repetitions of the subexpressions under the iteration operators. Counting rules are given by transition functions of the CFA. We use the algorithm Counting proposed in [25] to run the CFA. Let \(\mathcal {A}\) denote the constructed CFA and S denote the given finite sample. After the CFA \(\mathcal {A}\) recognized the sample S, let \(\mathcal {A}'\) denote the CFA \(\mathcal {A}\) which has updated the the minimum and maximum number of repetitions of the subexpressions under the iteration operators. Let \(\mathcal {A}'\!=\!Counting(\mathcal {A},S)\), and \(\mathsf {C}=\{(l(q),u(q))|l(q)\!=\!\mathcal {A}'.\mathsf {L}.l(q),u(q)\!=\!\mathcal {A}'.\mathsf {U}.u(q),q\in \mathcal {A}'.Q_c\}\). The elements in \(\mathsf {C}\) are counting operators, which will be introduced into an RCsore. The time complexity of Counting is \(\mathcal {O}(N\overline{L})\) time, where \(N=|S|\) and \(\overline{L}\) is the average length of the strings in S [25].

Example 5

For the sample \(S\!=\!\{a,acc,acbb,bab\}\), \(r_s\!=\!((a(c^+)?)^+|b^+)^+\) is the SORE obtained from Sect. 3.1, the CFA \(\mathcal {A}\) showed in Fig. 2 runs on the sample S. Then, the tuples in \(\mathsf {C}\) are listed as follows: \((l(c),u(c))\!=\!(1,2)\), \((l(b),u(b))\!=\!(1,1)\)Footnote 4, \((l(+_1),u(+_1))\!=\!(1,1)\), \((l(+_2),u(+_2))\!=\!(1,3)\). \(l(+_1)\) and \(u(+_1)\) (resp. \(l(+_2)\) and \(u(+_2)\)) are the minimum and maximum number of repetitions of the subexpression \((a(c^+)?)\) (resp. \((a(c^+)?|b)\)), respectively. Note that the minimum numbers of repetitions of symbol c are both 0 in strings a and bab. In Sect. 3.4, we will convert expression \(c^{[1,2]}\) to \((c^{[1,2]})?\).

3.4 Generating RCsore

In this section, we transform the updated CFA \(\mathcal {A}'\) obtained in Sect. 3.3 to an RCsore. Since the algorithm GenECsore can convert a CFA to an descriptive ECsore (w.r.t. the class of ECsores). We still can use the algorithm GenECsore to derive an RCsore, the constructed CFA in this paper is equivalent to an RCsore, not an equivalent representation of an ECsore. Then, for an updated CFA \(\mathcal {A}'\), the algorithm GenECsore can convert the CFA \(\mathcal {A}'\) to an descriptive RCsore (w.r.t. the class of RCsores).

figure e

Algorithm 5 converts the updated CFA to an RCsore. Theorem 3 demonstrates the finally obtained RCsore is descriptive of any given finite sample. Assume that the updated CFA contains \(n_c\) nodes and \(t_c\) transitions. GenECsore takes \(\mathcal {O}(n_ct_c)\) time to infer an ECsore [25]. Then, the time complexity of generating RCsore is \(\mathcal {O}(n_ct_c)\).

Example 6

The tuples in \(\mathsf {C}\) obtained from algorithm Counting are as follows. (l(c), u(c)) \(=\!(1,2)\), \((l(b),u(b))\!=\!(1,1)\), \((l(+_1),u(+_1))\!=\!(1,1)\) and \((l(+_2),u(+_2))\!=\!(1,3)\). For the updated CFA \(\mathcal {A}'\), the generated RCsore is \(((a(c^{[1,2]})?)|b)^{[1,3]}\).

Theorem 3

For any given finite language S, let \(r:=\)InfRCsore(S), the time complexity of algorithm InfRCsore is \(\mathcal {O}(n_ct_c+N\overline{L})\) and r is an RCsore-descriptive RCsore for S.

Let \(\mathcal {A}_c\) and \(\mathcal {A}_g\) denote the CFAs constructed in this paper and in literature [25], respectively. Assume that the CFA \(\mathcal {A}_g\) contains \(n_g\) nodes and \(t_g\) transitions. The time complexity of InfECsore is \(\mathcal {O}(n_gt_g+N\overline{L})\) [25]. \(\mathcal {A}_c\) and \(\mathcal {A}_g\) are equivalent representations of RCsore and ECsore, respectively. The CFA \(\mathcal {A}_g\) can contain more nodes labeled \(+_i\) (\(i\in \mathbb {N}\)) than the CFA \(\mathcal {A}_c\). And the transitions in \(\mathcal {A}_g\) can be also more than that in \(\mathcal {A}_c\). Thus, \(n_ct_c\le n_gt_g\).

4 Experiments

In this section, we validate our algorithm on real-world XML data and generated XML data. We also provide evaluations of our algorithm in terms of generalization ability and time performance.

4.1 Data and Experiments

Table 3 demonstrates the practicability of RCsores, then we evaluate our algorithm on XML data. We obtained XML documents (dblp-2018-04-01.xml) conforming to DTD from DBLP Computer Science Bibliography corpusFootnote 5, from which we extracted the elements: inproc(eedings), article, phdth(esis), incolle(ction), and procee(dings). We obtained XML documents conforming to XSD form Mondial corpusFootnote 6, from which the elements count(ry), provin (ce) and city are extracted. In order to validate on diverse XSDs, a number of real-world XSDs listed in Table 2 are searched from Google. However, we do not find the corresponding XML data, so we randomly generated them by using ToXgeneFootnote 7. The samples employed in the experiments are available at http://github.com/GraceFun/InfRCsore.

Table 1 lists the results of the learning algorithms Soa2Sore, InfECsore and InfRCsore on real-world XML data. Note that, based on descriptive generalization, Soa2Sore is the first algorithm being used to infer a SORE [12], and InfECsore is the algorithm being applied to learn a most practical subclass of deterministic regular expressions with counting: ECsore [25]. For each of the elements inproc(eedings), article and procee(dings), the corresponding expression learnt by InfRCsore is not only more precise than the corresponding expression in original DTD, but also more precise than the corresponding expression computed by Soa2Sore. Also, the result of InfRCsore is more general than the result of InfECsore, such that the learnt RCsore covers more XML data satisfying the corresponding original DTD than the learnt ECsore. For phdth(esis) and incolle(ction), the learnt RCsores are identical to the corresponding expressions computed by InfECsore. For each of elements count(ry), provin(ce) and city, the result of InfRCsore and the result of InfECsore are the same, and the corresponding RCsore and ECsore both are more precise than the corresponding expression generated by Soa2Sore and the corresponding expression in original XSD.

Table 2 lists a number of the expressions extracted from real-world XSDs and the results of the learning algorithms Soa2Sore, InfECsore and InfRCsore on generated XML data. For ep1, the learnt RCsore is identical to the learnt ECsore, they both indicate that more symbols or subexpressions can have numerical occurrence constraints, but are allowed to occur more times by the nested counters. For ep2, the learnt RCsore is identical to the learnt ECsore, they both are identical to the corresponding original XSD. This implies the original XSDs such as shown by ep2 could be precisely learnt by InfRCsore and InfECsore. For ep3 and ep4, although the learnt RCsores forbid the expressions learnt by InfECsore, which are more precise than the corresponding original XSD, even are identical to the corresponding original XSD for ep3, the learnt RCsores are more general than the learnt ECsores. Especially, for ep4, the learnt RCsore covers more XML data satisfying the corresponding original XSD than the learnt ECsore. For ep5, the learnt RCsore has the same higher nesting depth of counting operators with the learnt ECsore.

Table 1. Results of Soa2Sore, InfECsore and InfRCsore on real-world XML data. The left column gives element names, sample size for Soa2Sore, InfECsore and InfRCsore, respectively. The right column lists original DTD/XSD, the results of Soa2Sore, the results of InfECsore and the results of InfRCsore, respectively.
Table 2. Results of Soa2Sore, InfECsore and InfRCsore on generated XML data.
Table 3. Proportions of SOREs, ECsores, and RCsores.
Fig. 3.
figure 3

(a) is average precision as a function of the sample size for each of InfECsore and InfRCsore. (b) is average recall as a function of the sample size for each of InfECsore and InfRCsore.

4.2 Performance

Generalization Abilities. Since the corresponding results of the algorithms InfECsore and InfRCsore have different generalization abilities for the same sample (such as ep3 and ep4 showed in Table 2), we evaluate the algorithms InfECsore and InfRCsore by computing the precision and recall. We specify that, the learnt expression with higher precision and recall has better generalization ability. The average precision and average recall, which are as functions of sample size, respectively, are the average values over 1000 expressions.

We randomly extracted the 1000 expressions from XSDs, which were grabbed from OGC XML Schema repositoryFootnote 8. Each one of the 1000 expressions contains the counters, where the upper bounds are less than 100. To learn each extracted expression \(e_0\), we randomly generated corresponding XML data by using ToXgene, the samples are extracted from the XML data, each sample size is that listed in Fig. 3. And we define precision (p) and recall (r). Let positive sample (\(S_+\)) be the set of the all strings accepted by \(e_0\), and let negative sample (\(S_-\)) be the set of the all strings not accepted by \(e_0\). Let \(e_1\) be the expression derived by InfECsore or InfRCsore. A true positive sample (\(S_{tp}\)) is the set of the strings, which are in \(S_+\) and accepted by \(e_1\). While a false negative sample (\(S_{fn}\)) is the set of the strings, which are in \(S_+\) and rejected by \(e_1\). Similarly, a false positive sample (\(S_{fp}\)) is the set of the strings, which are in \(S_-\) and accepted by \(e_1\). While a true negative sample (\(S_{tn}\)) is the set of the strings, which are in \(S_-\) and rejected by \(e_1\). Then, let \(p=\frac{|S_{tp}|}{|S_{tp}|+|S_{fp}|}\) and \(r=\frac{|S_{tp}|}{|S_{tp}|+|S_{fn}|}\). Note that, for an RCsore, we can construct an equivalent counter automata [14]. The constructed counter automata can decide whether the samples \(S_+\) and \(S_-\) can be recognized or not, then we can obtain \(|S_{tp}|\), \(|S_{fp}|\) and \(|S_{fn}|\).

As the sample size increases, compared with the results of InfECsore, the plots in Fig. 3(a) demonstrate that the precision for the expression learnt by InfRCsore is higher for a smaller sample, but is lower for a larger sample. However, the plots in Fig. 3(b) illustrate that, for any given sample, the recall for the expression learnt by InfRCsore is higher than that for the expression derived by InfECsore. The reason is that, for the same sample, the learnt RCsore can have more constrains than the learnt ECsore such that some subexpressions without counting operators. This will reduce that the learnt RCsore is expressive enough to cover more XML data. In summary, InfRCsore has better generalization ability for a smaller sample.

Time Performance. Although Theorem 3 implies that, for learning a RCsore, the algorithm InfRCsore can be faster than the algorithm InfECsore, the quantitative analyses of time performance about the algorithms InfRCsore and InfECsore should be given. Then, we present the evaluation about running time in different size of samples and different size of alphabets. Our experiments were conducted on a ThinkCentre M8600t-D065 with an Intel core i7-6700 CPU (3.4GHz) and 8G memory. And all codes were written in C++.

Table 4(a) shows the average running times in seconds for InfRCsore and InfECsore as a function of sample size, respectively. Table 4(b) shows the average running times in seconds for InfRCsore and InfECsore as a function of alphabet size, respectively. We still randomly extracted expressions from XSDs according to the above mentioned method. 1000 expressions of alphabet size 15 are chosen that, to learn each one of them, we randomly generated corresponding XML data by using ToXgene, the samples are extracted from the XML data, each sample size is that listed in Table 4(a). The running times listed in Table 4(b) are averaged over 1000 expressions of that sample size. Another 1000 expressions with distinct alphabet size listed in Table 4(b) are chosen that, to learn each one of them, we also randomly generated corresponding XML data by using ToXgene, the samples are extracted from the XML data, but the corresponding sample size is 1000. The running times listed in Table 4(a) are averaged over 1000 expressions of that alphabet size.

The running times of InfRCsore as compared with that of InfECsore are reported in Table 4(a). They show that InfRCsore is more efficient than InfECsore on large samples. However, Table 4(b) illustrates that the speed of InfRCsore varies widely when the alphabet size is over 20. Thus, the time performances of InfRCsore and InfECsore demonstrate that the algorithm InfRCsore is more efficient for processing large data sets.

Table 4. (a) and (b) are average running times in seconds for InfRCsore and InfECsore as the functions of sample size and alphabet size, respectively.

5 Conclusion

This paper proposed a restricted subclass of deterministic regular expressions with counting: RCsores and the corresponding learning algorithm. The main steps include learning a SORE, constructing an equivalent CFA, running the CFA to obtain an updated CFA, and converting the updated CFA to an RCsore. Compared with previous work, for any given finite language, our algorithm not only can learn a descriptive RCsore, which has higher recall for any sample, but also has better generalization ability for smaller sample, and is more efficient for processing larger sample. A future work is extending the SORE with counting, interleaving, and unorder concatenation, studying the practical issues and the learning algorithms.