Keywords

1 Introduction

In grammatical inference [9], a learning algorithm La takes a finite sequence (usually strings) of examples as input and outputs a language description (usually grammars). There are two main types of presentations: (i) A text for a language L is an infinite sequence of strings \(x_1, x_2, \ldots \) from L such that every string of L occurs at least once in the text; (ii) An informant for a language L is an infinite sequence of pairs \((x_1, d_1), (x_2, d_2), \ldots \) in \(\varSigma ^{*} \times \mathbb {B}\) such that every string of \(\varSigma ^{*}\) occurs at least once in the sequence and \(d_i = \text {true} \iff x_i \in L\). The inference algorithms that use type (ii) of information are said to learn from positive and negative examples. From the Gold’s results [7], we know that the class of context-free languages (and even regular languages) cannot be identified from presentation (i), but can be identified using presentation (ii). However, de la Higuera [8] showed that it is computationally hard.

In this work, the following informant learning environment is exploited. Suppose that the inferring process is based on the existence of an Oracle, which can be seen as a device that:

  1. 1.

    Knows the language and has to answer correctly.

  2. 2.

    Can answer equivalence queries. They are made by proposing some hypothesis to the Oracle. The hypothesis is a grammar representing the unknown language. The Oracle answers Yes in the positive case. In the negative case, the Oracle has to return the shortest string in the symmetric difference between the target language and the submitted hypothesis.

Then the following procedure can be applied. Start from a smallFootnote 1 sample S and \(k=1\). The parameter k denotes the number of non-terminal symbols in the target grammar. Run an answer set program (or another exact method). Every time it turns out that there is no solution that satisfies all of the constraints, increase k by 1. As long as the Oracle returns a pair (xd) in response to an equivalent query, add (xd) to S and run the answer set program again (or respectively another exact method). Stop after the answer is Yes. Unfortunately, there is no guarantee that the procedure will terminate in a polynomial number of steps, even when the target language is regular [1]. The equivalence checking may be done by random sampling. The positive answer could be incorrect, but this probability decreases if the sampling is repeated.

A very similar procedure for the induction of context-free grammars was proposed by Imada and Nakamura [11]. However, for the exact searching of k-variable grammar, they used Boolean formulas and applied an SAT solver. We took over their main Boolean variables, treating them as predicates, and then constructed a new encoding founded on answer set programming. In an alternative approach, we used general constraints of Gurobi OptimizerFootnote 2 instead of ASP.

1.1 Related Work

The most closely related work to CFG identification is by Imada and Nakamura [11]. They proposed a way to synthesize CFGs from positive and negative samples based on solving a Boolean satisfiability problem (SAT). They translated the learning problem for a CFG into a SAT, which is then solved by a SAT solver. The result of the SAT solver satisfying the SAT contains a minimal set of rules (it can be easily changed to a minimal set of variables) that derives all positive samples and no negative samples.

They used one derivation constraint and two main types of Boolean variables:

  • Derivation variables. A set of derivation variables represents a relation between nonterminal symbols and substrings (in other words, derivation or parse tree) of each (positive or negative) sample w as follows: for any substring x of w and \(p \in V\), the derivation variable \(T^{p}_{x}\) represents that the nonterminal p derives the string x.

  • Rule variables. A set of rule variables represents a rule set as follows: for any \(p, q, r \in V\), \(a \in \varSigma \), a variable \(R^{p}_{qr}\) (or \(R^{p}_{a}\)) determines whether the production rule \(p \rightarrow q \, r\) (or \(p \rightarrow a\)) is a member of the set of rules or not.

The derivation constraint is a set of following Boolean expressions for any string \(a_1\cdots a_n\) (\(n > 1\)) and nonterminal \(p \in V\).

$$T^{p}_{a_1 \cdots a_n} \leftrightarrow \bigvee ^{n-1}_{i=1} \bigvee _{q \in V} \bigvee _{r \in V} \left( R^{p}_{qr} \wedge T^{q}_{a_1 \cdots a_i} \wedge T^{r}_{a_{i+1} \cdots a_n}\right) .$$

Nakamura et al. have been working on another approach for incremental learning of CFGs implemented in the Synapse system [15]. This approach is based on rule generation by analyzing the results of bottom-up parsing for positive samples and searching for rule sets. Their system can also learn similar CFGs but does it only from positive samples. Both methods synthesized similar rule sets for each language in their experiments. They reported that the computation time by the SAT-based approach is rather shorter than Synapse in most languages.

1.2 Our Contribution

The purpose of the present proposal is to investigate to what extent the power of an ASP solver makes it possible to tackle the context-free inference problem for large-size instances and to compare our approach with the original one. Because of the possibility of future comparisons with other methods, the Python implementationFootnote 3 of our winning method is given via GitLab.

The main original scientific contributions are as follows:

  • the formulation of the induction of a k-variable context-free grammar in terms of logical rules with answer set semantics;

  • the formulation of the induction of a k-variable context-free grammar in terms of general constraints;

  • the construction of an informant learning algorithm based on ASP, CSP, and SAT solvers;

  • the conduct of an appropriate statistical test in order to determine the fastest CFG inference method.

This paper is organized into five sections. In Sect. 2, we present necessary definitions and facts originating from formal languages and declarative problem-solving. Section 3 describes our inference algorithms: (a) based on solving an answer set program, and (b) based on solving a constraint satisfaction program, including general constraints such as AND/OR. Section 4 shows the experimental results of our approaches in comparison with the original one. Concluding comments are made in Sect. 5.

2 Preliminaries

We assume the reader to be familiar with basic context-free languages theory, e.g., from [10], so that we introduce only some notations and notions used later in the paper.

2.1 Words and Languages

An alphabet is a finite, non-empty set of symbols. We use the symbol \(\varSigma \) for the alphabet. A word is a finite sequence of symbols chosen from the alphabet. We denote the length of the word w by |w|. The empty word \(\varepsilon \) is the word with zero occurrences of symbols. Let x and y be words. Then xy denotes the catenation of x and y, that is, the word formed by making a copy of x and following it by a copy of y. As usual, \(\varSigma ^{*}\) denotes the set of words over \(\varSigma \). The word w is called a prefix of the word u if there is a word x such that \(u = wx\). We call it a proper prefix if \(x \ne \varepsilon \). The word w is called a suffix of the word u if there is a word x such that \(u = xw\). It is a proper suffix if \(x \ne \varepsilon \). A factor (or subword) is a prefix of a suffix. A set of words, all of which are chosen from some \(\varSigma ^{*}\), where \(\varSigma \) is a particular alphabet, is called a language.

2.2 Context-Free Grammars

context-free grammar (CFG) is defined by a quadruple \(G = (V, \varSigma , P, v_0)\), where V is an alphabet of variables (or sometimes non-terminal symbols), \(\varSigma \) is an alphabet of terminal symbols such that \(V \cap \varSigma = \emptyset \), P is a finite set of production rules in the form \(A \rightarrow \alpha \) for \(A \in V\) and \(\alpha \in (V \cup \varSigma )^{*}\), and \(v_0\) is a special non-terminal symbol called the start symbol. For simplicity’s sake, we write \(A \rightarrow \alpha _1 \;|\; \alpha _2 \;|\; \cdots \;|\; \alpha _k\) instead of \(A \rightarrow \alpha _1,\, A \rightarrow \alpha _1, \ldots , A \rightarrow \alpha _k\). We call the word \(x \in (V \cup \varSigma )^{*}\)sentential form. Let u, v be two words in \((V \cup \varSigma )^{*}\) and \(A \in V\). Then, we write \(u A v \Rightarrow u x v\), if \(A \rightarrow x\) is a rule in P. That is, we can substitute the word x for symbol A in a sentential form if \(A \rightarrow x\) is a rule in P. We call this rewriting a derivation. For any two sentential forms x and y, we write \(x \Rightarrow ^{*} y\), if there exists a sequence \(x = x_0, x_1, x_2, \ldots , x_n = y\) of sentential forms such that \(x_i \Rightarrow x_{i+1}\) for all \(i = 0, 1, \ldots , n-1\). The language L(G) generated by G is the set of all words over \(\varSigma \) that are generated by G; that is, \(L(G) = \{x \in \varSigma ^{*} \mid v_0 \Rightarrow ^{*} x\}\). A language is called a context-free language if it is generated by a context-free grammar. Assume that G is the unknown (target) CFG to be identified. An example (a positive word) of G is a word in L(G), and a counter-example (a negative word) of G is a word not in L(G).

normal form for context-free grammars is a form, for which any grammar can be converted to the respective normal form version. Amongst all normal forms for context-free grammars, the most useful and the most well-known one is the Chomsky normal form (CNF). A grammar is said to be in Chomsky normal form if each of its rules is in one of two possible forms:

  1. (a)

    \(X \rightarrow x,\, x \in \varSigma ,\, X \in V\), or

  2. (b)

    \(X \rightarrow Y \, Z,\, X, Y, Z \in V\).

2.3 Answer Set Programming

We will briefly introduce the idea of answer set programming (ASP). Those who are interested in a more detailed description of the topic, alternative definitions, and the formal specification of this kind of logic programming are referred to handbooks [3, 6], and [12].

A variable or constant is a term. An atom is \(a(t_1, \ldots , t_n)\), where a is a predicate of arity n and \(t_1, \ldots , t_n\) are terms. A literal is either a positive literal p or a negative literal \(\lnot p\), where p is an atom.

rule r is a clause of the form

$$\begin{aligned} a_0 \leftarrow a_1 \wedge \cdots \wedge a_k \wedge \lnot a_{k+1} \wedge \cdots \wedge \lnot a_m \quad m \ge 0, \end{aligned}$$
(1)

where \(a_0, \ldots , a_m\) are atoms. The atom \(a_0\) is the head or r, while the conjunction \(a_1 \wedge \cdots \wedge a_k \wedge \lnot a_{k+1} \wedge \cdots \wedge \lnot a_m\) is the body of r. By \(H(r)\), we denote the head atom, and by \(B(r)\) the set \(\{a_1, \ldots , a_k, \lnot a_{k+1}, \ldots , \lnot a_m\}\) of the body literals. \(B^{+}(r)\) (\(B^{-}(r)\), resp.) denotes the set of atoms occurring positively (negatively, resp.) in \(B(r)\). A program (also called ASP program) is a finite set of rules. A \(\lnot \)-free program is called positive. A term, atom, literal, rule, or a program is ground if no variables appear in it.

Let \(\mathcal {P}\) be a program. Let r be a rule in \(\mathcal {P}\), a ground instance of r is a rule obtained from r by replacingFootnote 4 every variable X in r by constants occurring in \(\mathcal {P}\). We denote the set of all the ground instances of the rules occurring in \(\mathcal {P}\) by \(\text {ground}(\mathcal {P})\).

An interpretation I for \(\mathcal {P}\) is a set of ground atoms. A ground positive literal A is true (false, resp.) w.r.t. I if \(A \in I\) (\(A \not \in I\), resp.). A ground negative literal \(\lnot A\) is true (false, resp.) w.r.t. I if \(A \not \in I\) (\(A \in I\), resp.).

Let r be a ground rule in \(\text {ground}(\mathcal {P})\). The head of r is true w.r.t. I if \(H(r) \in I\). The body of r is true w.r.t. I if all body literals of r are true w.r.t. I (i.e., \(B^{+}(r) \subseteq I\text { and }B^{-}(r) \cap I = \emptyset \)) and is false w.r.t. I otherwise. The rule r is satisfied (or true) w.r.t. I if r head is true w.r.t. I or r body is false w.r.t. I.

model for \(\mathcal {P}\) is an interpretation M for \(\mathcal {P}\) such that every rule \(r \in \text {ground}(\mathcal {P})\) is true w.r.t. M.

Given a program \(\mathcal {P}\) and an interpretation I, the reduct \(\mathcal {P}^I\) is the set of positive rules defined as follows:

$$\begin{aligned} \mathcal {P}^I = \{H(r) \leftarrow \bigwedge B^{+}(r) \mid r \in \text {ground}(\mathcal {P})\text { and }B^{-}(r) \cap I = \emptyset \}. \end{aligned}$$
(2)

I is an answer set of \(\mathcal {P}\) if I is the \(\subseteq \)-smallest model for \(\mathcal {P}^I\).

Over the last years, answer set programming has emerged as a declarative problem-solving paradigm. It is a programming methodology rooted in research on artificial intelligence and computational logic, and researchers use it in many areas of science and technology. For experiments we took advantages of clingo—one of the most efficient and widely used answer set programming system availableFootnote 5 today. In addition to standard definitions, clingo allows to define constraints, i.e., rules with the empty head, for instance

$$\begin{aligned} \leftarrow a(t) \end{aligned}$$
(3)

By adding this constraint to a program, we eliminate its answer sets that contain \(a(t)\). Adding the ‘opposite’ constraint

$$\begin{aligned} \leftarrow \lnot a(t) \end{aligned}$$
(4)

eliminates those answers that do not contain \(a(t)\). A constraint can be translated into a normal rule. To this end, the constraint

$$\begin{aligned} \leftarrow a_1 \wedge \cdots \wedge a_k \wedge \lnot a_{k+1} \wedge \cdots \wedge \lnot a_m \end{aligned}$$
(5)

is mapped onto the rule

$$\begin{aligned} x \leftarrow a_1 \wedge \cdots \wedge a_k \wedge \lnot a_{k+1} \wedge \cdots \wedge \lnot a_m \wedge \lnot x \end{aligned}$$
(6)

where x is a new atom.

Example. Suppose we have three numbered urns and two distinguishable balls. Every ball has been put to an urn, maybe to the same. An ASP program to code this knowledge is as follows:

$$\begin{aligned}&\text {urn}(1) \leftarrow \end{aligned}$$
(7)
$$\begin{aligned}&\text {urn}(2) \leftarrow \end{aligned}$$
(8)
$$\begin{aligned}&\text {urn}(3) \leftarrow \end{aligned}$$
(9)
$$\begin{aligned}&\text {ball}(q) \leftarrow \end{aligned}$$
(10)
$$\begin{aligned}&\text {ball}(r) \leftarrow \end{aligned}$$
(11)
$$\begin{aligned}&\text {contains}(U, B) \leftarrow \text {urn}(U) \wedge \text {ball}(B) \wedge \lnot \text {not}\_\text {in}(U, B) \end{aligned}$$
(12)
$$\begin{aligned}&\text {not}\_\text {in}(U, B) \leftarrow \text {urn}(U) \wedge \text {urn}(V) \wedge U \ne V \wedge \text {ball}(B) \wedge \text {contains}(V, B) \end{aligned}$$
(13)
$$\begin{aligned}&\text {in}(B) \leftarrow \text {urn}(U) \wedge \text {ball}(B) \wedge \text {contains}(U, B) \end{aligned}$$
(14)
$$\begin{aligned}&\leftarrow \text {ball}(B) \wedge \lnot \text {in}(B) \end{aligned}$$
(15)

Please notice that as usual in logic programming, identifiers with initial uppercase letters are assigned to variables. Rules 711 are simple facts concerning urns and balls. Rules 12 and 13 define predicates that tell whether a ball is inside in a particular urn. Inequality \(U \ne V\) is only used during grounding to eliminate some ground instances of rule 13. It is worth mentioning that grounding systems do not make unnecessary replacements, for example, 1 for U. Rules 14 and 15 ensure that every ball is exactly in one urn.

Suppose now that we have discovered that urn 2 is empty and we want to know possible configurations. It is enough to add two facts:

$$\begin{aligned}&\text {not}\_\text {in}(2, q) \leftarrow \end{aligned}$$
(16)
$$\begin{aligned}&\text {not}\_\text {in}(2, r) \leftarrow \end{aligned}$$
(17)

and find all answer sets. A possible answer set is: ball(q), ball(r), urn(1), urn(2), in(r), not_in(2, q), not_in(2, r), not_in(3, q), not_in(3, r), contains(1, q), in(q), urn(3), contains(1, r), which describes the placement of both balls into the first urn.

clingo also allows using choice constructions, for instance:

$$\begin{aligned}&\{\text {p}(U, B){:}\,\text {urn}(U)\} = 2 \leftarrow \text {ball}(B) \end{aligned}$$
(18)

describes all possible ways to choose which two of the atoms p(1, q), p(2, q), p(3, q) and which two of the atoms p(1, r), p(2, r), p(3, r) are included in the resultant model. Before and after an expression in braces, we can put integers, which express bounds on the cardinality of the stable models described by the rule. The number on the left is the lower bound (0 is default), and the number on the right is the upper bound (unbounded is default).

3 Proposed Encodings for the Induction of CFGs

Our translation converts CFG identification into an ASP program (the main approach) and CSP model (an alternative approach, constraint satisfaction problem). Suppose we are given a sample composed of examples, \(S_+\), and counter-examples, \(S_-\), over an alphabet \(\varSigma \), and a positive integer k. We want to find a k-variable CFG \(G = (V, \varSigma , P, v_0)\) such that \(S_+ \subseteq L(G)\) and \(S_- \cap L(G) = \emptyset \).

3.1 Using Logic Programming with Answer Set Semantics

Let F be the set of all factors (excluding the empty word) of \(S_+ \cup S_-\). Let us now see how to describe the rules for the relationship between a grammar G and a sample \(S_+ \cup S_-\) in terms of ASP. There are three main predicates: y(IJL), which indicates the presence of \(I \rightarrow J \, L\) in P; w(IQ), which indicates that \(I \Rightarrow ^{*} Q\), where Q represents a factor; and z(IA), which indicates the presence of \(I \rightarrow A\).

  1. 1.

    We have the following domain specification, our facts.

    $$\begin{aligned}&\text {variable}(i) \leftarrow&\quad&\text {for }i = 0, 1, \ldots , k-1 \end{aligned}$$
    (19)
    $$\begin{aligned}&\text {factor}(f) \leftarrow&\quad&\text {for all }f \in F \end{aligned}$$
    (20)
    $$\begin{aligned}&\text {terminal}(a) \leftarrow&\quad&\text {for all }a \in \varSigma \end{aligned}$$
    (21)
    $$\begin{aligned}&\text {positive}(s) \leftarrow&\quad&\text {for all }s \in S_+ \end{aligned}$$
    (22)
    $$\begin{aligned}&\text {negative}(s) \leftarrow&\quad&\text {for all }s \in S_- \end{aligned}$$
    (23)
    $$\begin{aligned}&\text {compose}(f, b, c) \leftarrow&\quad&\text {for such }f, b, c \in F\text { that }f = bc \end{aligned}$$
    (24)
  2. 2.

    The next rules ensure that in a grammar G a factor can or cannot be derived from a specific variable and ensure that in the grammar there is a subset of all possible productions.

    $$\begin{aligned}&\{ \text {w}(I, F) \} \leftarrow \text {variable}(I) \wedge \text {factor}(F) \end{aligned}$$
    (25)
    $$\begin{aligned}&\{ \text {z}(I, A) \} \leftarrow \text {variable}(I) \wedge \text {terminal}(A) \end{aligned}$$
    (26)
    $$\begin{aligned}&\{ \text {y}(I, J, L) \} \leftarrow \text {variable}(I) \wedge \text {variable}(J) \wedge \text {variable}(L) \end{aligned}$$
    (27)
    $$\begin{aligned}&\text {w}(I, A) \leftarrow \text {variable}(I) \wedge \text {terminal}(A) \wedge \text {z}(I, A) \end{aligned}$$
    (28)
    $$\begin{aligned}&\text {z}(I, A) \leftarrow \text {variable}(I) \wedge \text {terminal}(A) \wedge \text {w}(I, A) \end{aligned}$$
    (29)
  3. 3.

    All examples should be accepted, and no counter-example can be accepted.

    $$\begin{aligned}&\leftarrow \text {positive}(F) \wedge \lnot \text {w}(0, F) \end{aligned}$$
    (30)
    $$\begin{aligned}&\leftarrow \text {negative}(F) \wedge \text {w}(0, F) \end{aligned}$$
    (31)
  4. 4.

    For every \(f \in F\) for which \(|f| \ge 2\) and for every pair \((b, c)\) (\(b, c \in F\)) of such factors that \(bc = f\), f can be derived from a non-terminal I if there are two non-terminals, J and L, such that b can be derived from J, c can be derived from L, and there is a production \(I \rightarrow J \, L\).

    $$\begin{aligned}&\text {w}(I, F) \leftarrow \text {variable}(I) \wedge \text {variable}(J) \wedge \text {variable}(L) \nonumber \\&\qquad \qquad \qquad \wedge \, \text {compose(F, B, C)} \wedge \text {y}(I, J, L) \wedge \text {w}(J, B) \wedge \text {w}(L, C) \end{aligned}$$
    (32)
  5. 5.

    On the other hand, if \(I \Rightarrow ^{*} f\), then at least one such pair \((J, L)\) should exist, that \(I \rightarrow J \, L\) is in P and \(J \Rightarrow ^{*} b\) and \(L \Rightarrow ^{*} c\).

    $$\begin{aligned}&\leftarrow \text {variable}(I) \wedge \text {factor}(F) \wedge \lnot \text {terminal}(F) \wedge \text {w}(I, F) \nonumber \\&\qquad \qquad \wedge \, \{ \text {y}(I, J, L) :\text {variable}(J) \wedge \text {variable}(L) \nonumber \\&\qquad \qquad \qquad \qquad \,\, \wedge \, \text {compose}(F, B, C) \wedge \text {w}(J, B) \wedge \text {w}(L, C) \} = 0 \end{aligned}$$
    (33)

3.2 Using General Constraints

This time, instead of predicates, w, y, and z are binary variables. We use the following constraints

$$\begin{aligned}&w_{0s} = 1&\qquad&\text {for all }s \in S_+ \end{aligned}$$
(34)
$$\begin{aligned}&w_{0s} = 0&\qquad&\text {for all }s \in S_- \end{aligned}$$
(35)

and

$$\begin{aligned} w_{if} \;\; \leftrightarrow \sum _{j, l \in K,\, bc = f} y_{ijl} \wedge w_{jb} \wedge w_{lc} \; + \; (z_{if} \text { if } f \in \varSigma ) \end{aligned}$$
(36)

for each \((i, f) \in K \times F\), where \(\alpha \leftrightarrow \beta \) means if \(\alpha = 0\) then \(\beta = 0\) and if \(\alpha = 1\) then \(\beta \ge 1\), and \(K = \{0, 1, \ldots , k-1\}\).

4 Experimental Results

In this section, we describe some experiments comparing the performance of our approaches implementedFootnote 6 in Python, using clingo (ASP) and using Gurobi Optimizer, with our implementation of Imada et al. algorithm [11] using the PicoSAT solver (SAT), when positive and negative words are given. For these experiments, we use a set of 40 samples: partly based on randomly generated grammars (33 samples) and partly based on the set of fundamental CFGs appearing in grammatical inference research (the last 7 samples).

4.1 Benchmarks

For testing the learning power for general CFGs, we randomly generated 33 CFGs and prepared positive and negative samples with lengths no longer than 14 exhaustively enumerated for them. The grammars are in Chomsky normal form with 6 to 12 rules on the alphabet \(\{a, b\}\). In every sample, positive words constitute not less than 20% of the total.

The last seven samples are also with lengths no longer than 14 exhaustively enumerated, but they were generated based on the following descriptions:

  1. (a)

    The set of palindromes over \(\{a, b\}\).

  2. (b)

    The parentheses language: the set of strings consisting of equal numbers of a’s and b’s such that every prefix does not have more b’s than a’s.

  3. (c)

    The set of strings consisting of b’s twice as many as a’s.

  4. (d)

    The set of strings of a’s and b’s not of the form ww.

  5. (e)

    The complement of the language (b).

  6. (f)

    \(\{a^n b^n \mid n \ge 1\}\).

  7. (g)

    The set of strings consisting of equal numbers of a’s and b’s.

4.2 Performance Comparison

In all experiments, we used Intel Xeon CPU E5-2650 v2, 2.6 GHz (single-core out of eight), under Ubuntu 18.04 operating system with 60 GB available RAM. Algorithm 1 shows the process for synthesizing a grammar (the set of production rules with \(v_0\) being always the start symbol) from positive and negative words.

figure a

In the algorithm, \(S_+\) and \(S_-\) represent the set of positive and negative words as an input. The variables and hold sets of samples to be covered in the next loop iteration. The algorithm picks up a word from \(S_+\) or \(S_-\) that is not covered by the inferred grammar G, and add it to or . The function Convert translates the problem into a set of ASP rules R (or Gurobi general constraints or a Boolean expression). If the ASP solver (or Gurobi Optimizer or the SAT solver) finds a stable model M, the function Extract returns a set of production rules by analyzing the presence of particular y(ijl) and z(ia) atoms. The algorithm repeats this process—increasing k to relaxe the limit on the number of non-terminals—until G covers the all given \(S_+\) and \(S_-\).

Table 1. Execution times of exact solving CFG identification in seconds
Table 2. Obtained p-values from the Wilcoxon signed-rank test

The results are listed in Table 1. In order to determine whether the observed CPU time differences between ASP’s runs and the remaining methods’ runs did not occur by chance, we use the Wilcoxon signed-rank test [17, pp. 915–916] for ASP vs SAT and ASP vs Gurobi. The null hypothesis to be tested is that the median of the paired differences is negative (against the alternative that it is positive). As we can see from Table 2, p-value is high in both cases, so the null hypothesis cannot be rejected, and we may conclude that using our ASP encoding is likely to improve CPU time performance for most of this kind of benchmarks.

4.3 ASP-Based CFG Induction on Bioinformatics Datasets

Our induction method can also be applied to other data, that are not taken from context-free infinite languages. We tried its classification quality on two bioinformatics datasets: WALTZ-DB database [4], composed by 116 hexapeptides known to induce amyloidosis (\(S_+\)) and by 161 hexapeptides that do not induce amyloidosis (\(S_-\)) and Maurer-Stroh et al. database from the same domain [14], where the ratio of \(S_+ / S_-\) is 240/836.

We chose a few standard machine learning methods for comparison: BNB (Naive Bayes classifier for multivariate Bernoulli models [13, pp. 234–265]), DTC (Decision Trees Classifier, CART method [5]), MLP (Multi-layer Perceptron [16]), and SVM (Support Vector Machine classifier with the linear kernel [18]). In all methods except ASP and BNB, an unsupervised data-driven distributed representation, called ProtVec [2], was applied in order to convert words (protein representations) to numerical vectors. For using BNB, we represented words as binary-valued feature vectors that indicated the presence or absence of every pair of protein letters. In case of ASP, the training set was partitioned randomly into n parts, and the following process was being performed m times. Choosing one part for synthesizing a CFG and use rest \(n-1\) parts for validating it. The best of all m grammars—in terms of higher F-measure—was then confronted with the test set. For WALTZ-DB n and m have been set to 20, for Maurer-Stroh n has been set to 10 and m to 30. These values were selected experimentally based on the size of databases and the running time of the ASP solver.

To estimate the ASP’s and compared approaches’ ability to classify unseen hexapeptides repeated 10-fold cross-validation (cv) strategy was used. It means splitting the data randomly into 10 mutually exclusive folds, building a model on all but one fold, and evaluating the model on the skipped fold. The procedure was repeated 10 times and the overall assessment of the model was based on the mean of those 10 individual evaluations. Table 3 summarizes the performances of the compared methods on WALTZ-DB and Maurer-Stroh databases. It is noticable that the ASP approach achieved best F-score for smaller dataset (Maurer-Stroh) and an average F-score for the bigger one (WALTZ-DB), hence it can be used with a high reliability to recognize amyloid proteins. BNB is outstanding for the WALTZ-DB and almost as good as ASP for Maurer-Stroh database.

Table 3. Performance of compared methods on WALTZ-DB and Maurer-Stroh databases in terms of Precision (P), Recall (R), and F-score (F1)

5 Conclusion

In this paper, we proposed an approach for learning context-free grammars from positive and negative samples by using logic programming. We encode the set of samples, together with limits on the number of non-terminals to be synthesized as an answer set program. A stable model (an answer set) for the program contains a set of grammar rules that derives all positive samples and no negative samples. A feature of this approach is that we can synthesize a compact set of rules in Chomsky normal form. The other feature is that our learning method reflects future improvements on ASP solvers. We present experimental results on learning CFGs for fundamental context-free languages, including a set of strings composed of the equal numbers of a’s and b’s and the set of strings over \(\{a, b\}\) not of the form ww. Another series of experiments on random languages shows that our encoding can speed up computations in comparison with SAT and CSP encodings.