Keywords

1 Introduction

Automata have been widely studied and utilized for pattern and string matching problems. A string automaton reads the symbols of an input string one at a time, after which it accepts or rejects the string. But in certain instances, the order in which the symbols appear is irrelevant.

For example, in a set of graphs, the edges incident to a node are unordered and therefore their labels form a commutative language. Or, in natural language processing, applications might arise in situations where a sentence is generated by a context-free grammar subject to (hard or soft) order-independent constraints. For example, in summarization, there might be an unordered set of facts that must be included. Or, there might be a constraint that among the references to a particular entity, exactly one is a full NP.

To handle these scenarios, we are interested in weighted automata and weighted regular expressions for multisets. This paper makes three main contributions:

  • We define a new translation from weighted multiset regular expressions to weighted multiset automata, more direct than that of Chiang et al. [3] and more compact (but less general) than that of Droste and Gastin [4].

  • We discuss how to train weighted multiset automata and regular expressions from data.

  • We give a new composable representation of partial runs of weighted multiset automata that is more efficient than that of Chiang et al. [3].

2 Definitions

We begin by defining weighted multiset automata (Sect. 2.2) and the related definitions from previous papers for weighted multiset regular expressions (Sect. 2.3).

2.1 Preliminaries

For any natural number n, let \([n] = \{1, \ldots , n\}\).

A multiset over a finite alphabet \(\varSigma \) is a mapping from \(\varSigma \) to \(\mathbb {N}_0\). For consistency with standard notation for strings, we write a (where \(a \in \varSigma \)) instead of \(\{a\}\), uv for the multiset union of multisets u and v, and \(\epsilon \) for the empty multiset.

The Kronecker product of a \(m \times n\) matrix A and a \(p \times q\) matrix B is the \(mp \times nq\) matrix

$$\begin{aligned} A \otimes B = \left[ \begin{array}{ccc} A_{11} B &{} \cdots &{} A_{1m} B \\ \vdots &{} \ddots &{}\vdots \\ A_{n1} B &{} \cdots &{} A_{mn} B \\ \end{array} \right] . \end{aligned}$$

If w is a string over \(\varSigma \), we write \({{\mathrm{alph}}}(w)\) for the subset of symbols actually used in w; similarly for \({{\mathrm{alph}}}(L)\) where L is a language. If \(|{{\mathrm{alph}}}(L)|=1\), we say that L is unary.

2.2 Weighted Multiset Automata

We formulate weighted automata in terms of matrices as follows. Let \(\mathbb {K}\) be a commutative semiring.

Definition 1

A \(\mathbb {K}\)-weighted finite automaton (WFA) over \(\varSigma \) is a tuple \(M=(Q, \varSigma , \lambda , \mu , \rho )\), where \(Q = [d]\) is a finite set of states, \(\varSigma \) is a finite alphabet, \(\lambda \in \mathbb {K}^{1 \times d}\) is a row vector of initial weights, \(\mu : \varSigma \rightarrow \mathbb {K}^{d \times d}\) assigns a transition matrix to every symbol, and \(\rho \in \mathbb {K}^{d \times 1}\) is a column vector of final weights.

For brevity, we extend \(\mu \) to strings: If \(w \in \varSigma ^*\), then \(\mu (w) = \mu (w_1) \cdots \mu (w_n)\). Then, the weight of all paths accepting w is \(M(w) = \lambda \, \mu (w) \, \rho \). Note that in this paper we do not consider \(\epsilon \)-transitions. Note also that one unusual feature of our definition is that it allows a WFA to have more than one initial state.

Definition 2

A \(\mathbb {K}\)-weighted multiset finite automaton is one whose transition matrices commute pairwise. That is, for all \(a, b \in \varSigma \), we have \(\mu (a)\mu (b) = \mu (b)\mu (a)\).

2.3 Weighted Multiset Regular Expressions

This definition follows that of Chiang et al. [3], which in turn is a special case of that of Droste and Gastin [4].

Definition 3

A \(\mathbb {K}\)-weighted multiset regular expression over \(\varSigma \) is an expression belonging to the smallest set \(\mathcal {R}(\varSigma )\) satisfying:

  • If \(a \in \varSigma \), then \(a \in \mathcal {R}(\varSigma )\).

  • \(\epsilon \in \mathcal {R}(\varSigma )\).

  • \(\emptyset \in \mathcal {R}(\varSigma )\).

  • If \(\alpha , \beta \in \mathcal {R}(\varSigma )\), then \(\alpha \cup \beta \in \mathcal {R}(\varSigma )\).

  • If \(\alpha , \beta \in \mathcal {R}(\varSigma )\), then \(\alpha \beta \in \mathcal {R}(\varSigma )\).

  • If \(\alpha \in \mathcal {R}(\varSigma )\), then \(\alpha ^*\in \mathcal {R}(\varSigma )\).

  • If \(\alpha \in \mathcal {R}(\varSigma )\) and \(k \in \mathbb {K}\), then \(k\alpha \in \mathcal {R}(\varSigma )\).

We define the language described by a regular expression, \(\mathcal {L}(\alpha )\), by analogy with string regular expressions. Note that \(\epsilon \) matches the empty multiset, while \(\emptyset \) does not match any multisets. Interspersing weights in regular expressions allows regular expressions to describe weighted languages.

Definition 4

A multiset mc-regular expression is one where in every subexpression \(\alpha ^*\), \(\alpha \) is:

  • proper: \(\epsilon \notin \mathcal {L}(\alpha )\), and

  • monoalphabetic and connected: \(\mathcal {L}(\alpha )\) is unary.

As an example of why these restrictions are needed, consider the regular expression \((ab)^*\). Since the symbols commute, this is equivalent to \(\{a^nb^n\}\), which multiset automata would not be able to recognize. From now on, we assume that all multiset regular expressions are mc-regular and do not write “mc-.”

3 Matching Regular Expressions

In this section, we consider the problem of computing the weight that a multiset regular expression assigns to a multiset. The bad news is that this problem is NP-complete (Sect. 3.1). However, we can convert a multiset regular expression to a multiset automaton (Sect. 3.2) and run the automaton.

3.1 NP-Completeness

Theorem 1

The membership problem for multiset regular expressions is NP-complete.

Proof

Define a transformation \(\mathcal {T}\) from Boolean formulas in CNF over a set of variables X to multiset regular expressions over the alphabet \(X \cup \{\bar{x} \mid x \in X\}\):

$$\begin{aligned} \mathcal {T}(\phi _1 \vee \phi _2)&= \mathcal {T}(\phi _1) \cup \mathcal {T}(\phi _2) \\ \mathcal {T}(\phi _1 \wedge \phi _2)&= \mathcal {T}(\phi _1) \mathcal {T}(\phi _2) \\ \mathcal {T}(x)&= x \\ \mathcal {T}(\lnot x)&= \bar{x} \end{aligned}$$

Given a formula \(\phi \) in 3CNF, construct the multiset regular expression \(\alpha = \mathcal {T}(\phi )\). Let n be the number of clauses in \(\phi \). Then form the expression

$$\begin{aligned} \beta = \prod _x \left( x^n (\bar{x} \cup \epsilon )^n \cup (x \cup \epsilon )^n \bar{x}^n\right) \end{aligned}$$

Both \(\alpha \) and \(\beta \) clearly have length linear in n. We claim that \(\phi \) is satisfiable if and only if \(L(\alpha \beta )\) contains \(w = \prod _x x^n \bar{x}^n\).

  • \((\Rightarrow )\) If \(\phi \) is satisfiable, form a string \(u = u_1 \cdots u_n\) as follows. For \(i = 1, \ldots n\), the ith clause of \(\phi \) has at least one literal made true by the satisfying assignment. If it’s x, then \(u_i = x\); if it’s \(\lnot x\), then \(u_i = \bar{x}\). Clearly, \(u \in L(\alpha )\). Next, form a string \(v = \prod _x v_x\), where the \(v_x\) are defined as follows. For each x, if x is true under the assignment, then there are \(k \ge 0\) occurrences of x in u and zero occurrences of \(\bar{x}\) in u. Let \(v_x = x^{n-k} \bar{x}^n\). Likewise, if x is false under the assignment, then there are \(k \ge 0\) occurrences of \(\bar{x}\) and zero occurrences of x, so let \(v_x = x^k \bar{x}^{n-k}\). Clearly, \(uv = w\) and \(v \in L(\beta )\).

  • \((\Leftarrow )\) If \(w \in L(\alpha \beta )\), then there exist strings \(uv=w\) such that \(u \in L(\alpha )\) and \(v \in L(\beta )\). For each x, it must be the case that v contains either \(x^n\) or \(\bar{x}^n\), so that u must either not contain x or not contain \(\bar{x}\). In the former case, let x be false; in the latter case, let x be true. The result is a satisfying assignment for \(\phi \).   \(\square \)

3.2 Conversion to Multiset Automata

Given a regular expression \(\alpha \), we can construct a finite multiset automaton corresponding to that regular expression. In addition to \(\lambda \), \(\mu (a)\), and \(\rho \), we compute Boolean matrices \(\kappa (a)\) with the same dimensions as \(\mu (a)\). The interpretation of these matrices is that whenever the automaton is in state q, then \([\kappa (a)]_{qq} = 1\) iff the automaton has not read an a yet.

If \(\alpha = a\), then for all \(b \ne a\):

$$\begin{aligned} \begin{array}{cccc} \lambda = \begin{bmatrix} 1 &{} 0 \end{bmatrix} &{} \qquad \mu (a)= \begin{bmatrix} 0 &{} 1 \\ 0 &{} 0 \\ \end{bmatrix} &{} \qquad \kappa (a) = \begin{bmatrix}1 &{} 0 \\ 0 &{} 0 \end{bmatrix} &{} \qquad \rho = \begin{bmatrix}0 \\ 1 \end{bmatrix}\\[14pt] &{} \qquad \mu (b) =\, \begin{bmatrix} 0 &{} 0 \\ 0 &{} 0 \\ \end{bmatrix} &{} \qquad \kappa (b) = \,\begin{bmatrix}1 &{} 0 \\ 0 &{} 1\end{bmatrix}.&{} \end{array} \end{aligned}$$

If \(\alpha = k \alpha _1\) (where \(k \in \mathbb {K}\)), then for all \(a \in \varSigma \):

$$\begin{aligned} \mu (a)&=\mu _1(a)&\lambda&= \lambda _1&\rho&= k \rho _1&\kappa (a)&= \kappa (a). \end{aligned}$$

If \(\alpha = \alpha _1 \cup \alpha _2\), then for all \(a \in \varSigma \):

$$\begin{aligned} \mu (a) = \begin{bmatrix}\mu _1(a)&0 \\ 0&\mu _2(a)\end{bmatrix} \lambda&= \begin{bmatrix}\lambda _1&\lambda _2 \end{bmatrix}&\rho&= \begin{bmatrix}\rho _1 \\ \rho _2 \end{bmatrix}&\kappa (a)&= \begin{bmatrix}\kappa _1(a)&0 \\ 0&\kappa _2(a) \end{bmatrix} \end{aligned}$$

If \(\alpha = \alpha _1 \alpha _2\), then for all \(a \in \varSigma \):

$$\begin{aligned} \mu (a)&= \mu _1(a) \otimes \kappa _2(a) + I \otimes \mu _2(a)&\lambda&= \lambda _1 \otimes \lambda _2 \\ \kappa (a)&= \kappa _1(a) \otimes \kappa _2(a)&\rho&= \rho _1 \otimes \rho _2. \end{aligned}$$

If \(\alpha = \alpha _1^*\) and \(\alpha _1\) is unary, then for all \(a \in \varSigma \):

$$\begin{aligned} \mu (a)&= \mu _1(a) + \rho \lambda \mu _1(a)&\lambda&= \lambda _1&\rho&= \rho _1 + \lambda _1^{\top }&\kappa (a)&= \kappa _1(a). \end{aligned}$$

This construction can be explained intuitively as follows. The case \(\alpha =a\) is standard. The union operation is standard except that the use of two initial states makes for a simpler formulation. The shuffle product is similar to a conventional shuffle product except for the use of \(\kappa _2\). It builds an automaton whose states are pairs of states of the automata for \(\alpha _1\) and \(\alpha _2\). The first term in the definition of \(\mu (a)\) feeds a to the first automaton and the second term to the second; but it can be fed to the first only if the second has not already read an a, as ensured by \(\kappa _2(a)\). Finally, Kleene star adds a transition from final states to “second” states (states that are reachable from the initial state by a single a-transition), while also changing all initial states into final states.

Let \(A(\alpha )\) denote the multiset automaton constructed from \(\alpha \). We can bound the number of states of \(A(\alpha )\) by \(2^{|\alpha |}\) by induction on the structure of \(\alpha \). For \(\alpha = \epsilon \), \(|A(\alpha )| = 1 \le 2^{|\alpha |}\). For \(\alpha = a\), \(|A(\alpha )| = 2 \le 2^{|\alpha |}\). For \(\alpha = \alpha _1 \bigcup \alpha _2\), \(|A(\alpha )| = |A(\alpha _1)| + |A(\alpha _2)| \le 2^{|\alpha |}\). For \(\alpha = \alpha _1\alpha _2\), \(|A(\alpha )| = |A(\alpha _1)| |A(\alpha _2)| \le 2^{|\alpha |}\). For \(\alpha = \alpha _1^*\), \(|A(\alpha )| = |A(\alpha _1)| \le 2^{|\alpha |}\).

3.3 Related Work

Droste and Gastin [4] show how to perform regular operations for the more general case of trace automata (automata on monoids). Our use of \(\kappa \) resembles their forward alphabet. Our construction does not utilize anything akin to their backward alphabet, so that we allow outgoing edges from final states and we allow initial states to be final states. Their construction, when converting \(\alpha _1^*\), creates \(m=|{{\mathrm{alph}}}(\alpha _1)|\) simultaneous copies of \(A(\alpha _1)\), that is, it creates an automaton with \(|A(\alpha _1)|^m\) states. Since our Kleene star is restricted to the unary case, we can use the standard, much simpler, Kleene star construction [2].

Our construction is a modification of a construction from previous work [3]. Previously, the shuffle operation required \({{\mathrm{alph}}}(\alpha _1)\) and \({{\mathrm{alph}}}(\alpha _2)\) to be disjoint; to ensure this required some rearranging of the regular expression before converting to an automaton. Our construction, while sharing the same upper bound on the number of states, operates directly on the regular expression without any preprocessing.

4 Learning Weights

Given a collection of multisets, the weights of the transition matrices and the initial and final weights can be learned automatically from data. Given a multiset w, we let \(\mu (w)= \prod _{i} \mu (w_i)\). The probability of w over all possible multisets is

$$\begin{aligned} P(w)&= \frac{1}{Z} \lambda \mu (w) \rho \\ Z&= \sum _{\text {multisets }{{w'}}} \lambda \mu (w') \rho . \end{aligned}$$

We must restrict \(w'\) to multisets up to a given length bound, which can be set based on the size of the largest multiset which is reasonable to occur in the particular setting of use. Without this restriction, the infinite sum for Z will diverge in many cases. For example, if \(\alpha = a^*\), then \(\mu (a)^n=\mu (a)\) and thus \(\lambda \mu (a) \rho = \lambda \mu (a)^n \rho \). Since this value is non-zero, the sum diverges.

The goal is to minimize the negative log-likelihood given by

$$\begin{aligned} L = - \sum _{w \in \text {data}} \log P(w). \end{aligned}$$

To this end, we envision and describe two unique scenarios for how the multiset automata are formed.

4.1 Regular Expressions

In certain circumstances, we may start with a set of rules as weighted regular expressions and wish to learn the weights from data. Conversion from weighted regular expressions to multiset automata can be done automatically, see Sect. 3.2. Now the multiset automata that result already have commuting transition matrices. The weights from the weighted regular expression are the parameters to be learned. These parameters can be learned through stochastic gradient descent with the gradient computed through automatic differentiation, and the transition matrices will retain their commutativity by design.

4.2 Finite Automata

We can learn the weighted automaton entirely from data by starting with a fully connected automaton on n nodes. All initial, transition, and final weights are initialized randomly. Learning proceeds by gradient descent on the log-likelihood with a penalty to encourage the transition matrices to commute. Thus our modified log-likelihood is

$$\begin{aligned} L' = L + \alpha \sum _{a,b} (\mu (a) \mu (b) - \mu (b)\mu (a)) \end{aligned}$$

Over time we increase the penalty by increasing \(\alpha \). This method has the benefit of allowing us to learn the entire structure of the automaton directly from data without having to form rules as regular expressions. Additionally, since we set n at the start, the number of states can be kept small and computationally feasible. The main drawback of this method is that the transition matrices, while penalized for not commuting, may not exactly satisfy the commuting condition.

5 Computing Inside Weights

We can compute the total weight of a multiset incrementally by starting with \(\lambda \) and multiplying by \(\mu (a)\) for each a in the multiset. But in some situations, we might need to compose the weights of two partial runs. That is, having computed \(\mu (u)\) and \(\mu (v)\), we want to compute \(\mu (uv)\) in the most efficient way. Sometimes we also want to be able to compute \(\mu (u)+\mu (v)\) in the most efficient way.

For example, if we divide w into parts u and v to compute \(\mu (u)\) and \(\mu (v)\) in parallel [9], afterwards we need to compose them to form \(\mu (w)\). Or, we could intersect a context-free grammar with a multiset automaton, and parsing with the CKY algorithm would involve multiplying and adding these weight matrices. The recognition algorithm for extended DAG automata [3] uses multiset automata in this way as well.

Let M be a multiset automaton and \(\mu (a)\) its transition matrices. Let us call \(\mu (w)\) the matrix of inside weights of w. If stored in the obvious way, it takes \(\mathcal {O}(d^2)\) space. If \(w=uv\) and we know \(\mu (u)\) and \(\mu (v)\), we can compute \(\mu (w)\) by matrix multiplication in \(\mathcal {O}(d^3)\) time. Can we do better?

The set of all matrices \(\mu (w)\) spans a module which we call \({{\mathrm{Ins}}}(M)\). We show in this section that, under the right conditions, if M has d states, then \({{\mathrm{Ins}}}(M)\) has a generating set of size d, so that we can represent \(\mu (w)\) as a vector of d coefficients. We begin with the special case of unary languages (Sect. 5.1), then after a brief digression to more general languages (Sect. 5.2), we consider multiset regular expressions converted to multiset automata (Sect. 5.3).

5.1 Unary Languages

Suppose that the automaton is unary, that is, over the alphabet \(\varSigma = \{a\}\). Throughout this section, we write \(\mu \) for \(\mu (a)\) for brevity.

Ring-Weighted. The inside weights of a string \(w = a^n\) are simply the matrix \(\mu ^n\), and the inside weights of a set of strings is a polynomial in \(\mu \). We can take this polynomial to be our representation of inside weights, if we can limit the degree of the polynomial.

The Cayley-Hamilton theorem (CHT) says that any matrix \(\mu \) over a commutative ring satisfies its own characteristic equation, \(\det (\lambda I-\mu ) = 0,\) by substituting \(\mu \) for \(\lambda \). The left-hand side of this equation is the characteristic polynomial; its highest-degree term is \(\lambda ^d\). So if we substitute \(\mu \) into the characteristic equation and solve for \(\mu ^d\), we have a way of rewriting any polynomial in \(\mu \) of degree d or more into a polynomial of degree less than d.

So representing the inside weights as a polynomial in \(\mu \) takes only O(d) space, and addition takes O(d) time. Naive multiplication of polynomials takes \(O(d^2)\) time; fast Fourier transform can be used to speed this up to \(O(d \log d)\) time, although d would have to be quite large to make this practical.

Semiring-Weighted. Some very commonly used weights do not form rings: for example, the Boolean semiring, used for unweighted automata, and the Viterbi semiring, used to find the highest-weight path for a string.

There is a version of CHT for semirings due to Rutherford [10]. In a ring, the characteristic equation can be expressed using the sums of determinants of principal minors of order r. Denote the sum of positive terms (even permutations) as \(p_r\) and sum of negative terms (odd permutations) as \(-q_r\). Then Rutherford expresses the characteristic equation applicable for both rings and semirings as

$$\begin{aligned} \lambda ^n+q_1 \lambda ^{n-1}+p_2 \lambda ^{n-2} + q_3 \lambda ^{n-3} + \ldots&= p_1 \lambda ^{n-1}+q_2 \lambda ^{n-2} + p_3 \lambda ^{n-3} + \ldots \end{aligned}$$

For any \(K \subseteq \mathbb {N}\), let \(S_K\) be the set of all permutations of K, and let \({{\mathrm{sgn}}}(\sigma )\) be \(+1\) for an even permutation and \(-1\) for an odd permutation. The characteristic polynomial is

$$\begin{aligned} \sum _{K \subseteq [d]} \sum _{\pi \in S_K \atop {{\mathrm{sgn}}}(\pi ) \ne (-1)^{|K|}} \left( \prod _{i\in K} \mu _{i,\pi (i)}\right) \lambda ^{d-|K|}&= \sum _{K \subseteq [d]} \sum _{\pi \in S_K \atop {{\mathrm{sgn}}}(\pi ) = (-1)^{|K|}} \left( \prod _{i \in K} \mu _{i,\pi (i)}\right) \lambda ^{d-|K|}. \end{aligned}$$
(1)

If we can ensure that the characteristic equation has just \(\lambda ^d\) on the left-hand side, then we have a compact representation for inside weights. The following result characterizes the graphs for which this is true.

Theorem 2

Given a semiring-weighted directed graph G, the characteristic equation of G’s adjacency matrix, given by the semiring version of CHT, has only \(\lambda ^d\) on its left-hand side if and only if G does not have two node-disjoint cycles.

Proof

Let K be a node-induced subgraph of the directed graph G. A linear subgraph of K is a subgraph of K that contains all nodes in K and each node has indegree and outdegree 1 within the subgraph, that is, a collection of directed cycles such that each node in K occurs in exactly one cycle. Every permutation \(\pi \) of K corresponds to the linear subgraph of K containing edges \((i, \pi (i))\) for each \(i \in K\) [6].

Note that \({{\mathrm{sgn}}}(\pi ) = +1\) iff the corresponding linear subgraph has an even number of even-length cycles. Moreover, note that \({{\mathrm{sgn}}}(\pi )=(-1)^{|K|}\) appearing in (1) holds iff the corresponding linear subgraph has an even number of cycles (of any length). So if the transition graph does not have two node-disjoint cycles, the only nonzero term in (1) with \({{\mathrm{sgn}}}(\pi )=(-1)^{|K|}\) is that for which \(K = \emptyset \), that is, \(\lambda ^d\). To prove the other direction, suppose that the graph does have two node-disjoint cycles; then the linear subgraph containing just these two cycles corresponds to a \(\pi \) that makes \({{\mathrm{sgn}}}(\pi )=(-1)^{|K|}\).   \(\square \)

The coefficients in (1) look difficult to compute; however, the product inside the parentheses is zero unless the permutation \(\pi \) corresponds to a cycle in the transition graph of the automaton. Given that we are interested in computing this product on linear subgraphs, we are only concerned with simple cycles. Using an algorithm by Johnson [8], all simple cycles in a directed graph can be found in \(\mathcal {O}((n+e)(c+1))\) with \(n=\text {number of nodes}\), \(e=\text {number of edges}\), and \(c=\text {number of simple cycles}\).

Theorem 3

A digraph D with no two disjoint dicycles has at most \(2^{|V|-1}\) simple dicycles.

Proof

First, a theorem from Thomassen [11] limits the number of cases we must consider. In the first case, one vertex, \(v_s\), is contained in every cycle. If we consider \(G \setminus \{v_s\}\), this is a directed acyclic graph (DAG) and thus there is a partial order determined by reachability. This partial order determines the order that vertices appear in any cycle in G, which limits the number of simple cycles to the number of choices for picking vertices to join \(v_s\) in each cycle. This is a binary choice on \(|V|-1\) vertices, thus \(2^{|V|-1}\) possible cycles (see Fig. 1).

In the second case, the graph contains a subgraph with 3 vertices with no self loops, but all 6 other possible edges between them. If we let S be the set of these three vertices, then \(G \setminus S\) has a partial order on it just as in the first case. Additionally, for each \(s \in S\), there exists a partial order on \(G \setminus (S\setminus \{s\})\), and these uniquely determine the order of vertices in any cycle in G. While the bound could be lowered, this is bounded above by \(2^{|V|-1}\).

All other cases can be combined with the second case by observing that they all start with the same graph as the second case, then modified by subdivision (breaking an edge in two by inserting a vertex in the middle) or splitting (breaking a vertex in two, one with all in edges, one with all out edges, then adding one edge from the in vertex to the out vertex). These cases do not violate the arguments of the second case, nor add any additional cycles. Intuitively, these are graphs from case two with some edge(s) deleted.    \(\square \)

Fig. 1.
figure 1

A directed graph achieving the \(2^{|V|-1}\) simple cycle bound.

5.2 Digression: Binary Languages and Beyond

If \(\varSigma \) has two symbols and the transition matrices are commuting matrices over a field, then inside weights can still be represented in d dimensions [5]. We give only a brief sketch here of the simpler, algebraically closed case [1].

Given a matrix M with entries in an algebraically closed field, there exists a matrix S such that \(S^{-1}MS\) is in Jordan form. A matrix in Jordan form has the following block structure. Each \(A_i\) is a square matrix and \(\lambda _i\) is an eigenvalue.

$$\begin{aligned} S^{-1}MS= \begin{bmatrix} A_1&0\\&\ddots&\\ 0&A_p\\ \end{bmatrix} \qquad A_i = \begin{bmatrix} \lambda _i&1&0\\&\lambda _i&\ddots&\\&\ddots&1 \\ 0&&\lambda _i \end{bmatrix} \end{aligned}$$

Let the number of rows in \(A_i\) be \(k_i\). Here let \(M = \mu (a)\) be one of the commuting transition matrices. Then the following matrices span the algebra generated by the commuting transition matrices \(\mu (a)\) and \(\mu (b)\):

$$\begin{aligned} 1, \mu (a),&\ldots , \mu (a)^{k_1-1},\\ \mu (b), \mu (a)\mu (b),&\ldots , \mu (a)^{k_2-1}\mu (b),\\&\vdots \\ \mu (b)^{p-1}, \mu (a)\mu (b)^{p-1},&\ldots , \mu (a)^{k_p-1}\mu (b)^{p-1}. \end{aligned}$$

The number of matrices in this span is equal to the dimension of \(\mu (a)\) and \(\mu (b)\), which in our case is d. Further, a basis for the algebra is contained within this span. Therefore the inside weights can be represented in d dimensions.

On the other hand, if the weights come from a ring, the above fact does not hold in general [7]. Going beyond binary languages, if \(\varSigma \) has four or more symbols, then inside weights might need as many as \(\lfloor d^2/4 \rfloor +1\) dimensions, which is not much of an improvement [5]. The case of three symbols remains open [7] (Fig. 2).

Fig. 2.
figure 2

Example commutative automaton whose inside weights require storing more than d values.

5.3 Regular Expressions

Based on the above results, we might not be optimistic about efficiently representing inside weights for languages other than unary languages. But in this subsection, we show that for multiset automata converted from multiset regular expressions, we can still represent inside weights using only d coefficients. We show this inductively on the structure of the regular expression.

First, we need some properties of the matrices \(\kappa (a)\).

Lemma 1

If \(\mu (a)\) and \(\kappa (a)\) are constructed from a multiset regular expression, then

  1. 1.

    \(\kappa (a)\kappa (a) = \kappa (a)\).

  2. 2.

    \(\kappa (a)\kappa (b) = \kappa (b)\kappa (a)\).

  3. 3.

    \(\mu (a) \kappa (a) = 0\).

  4. 4.

    \(\mu (a) \kappa (b) = \kappa (b) \mu (a)\) if \(a \ne b\).

To show that \({{\mathrm{Ins}}}(M)\) can be expressed in d dimensions, we will need to prove an additional property about the structure of \({{\mathrm{Ins}}}(M)\). Note that if \({{\mathrm{Ins}}}(M)\) is not a free-module, then \(\dim {{\mathrm{Ins}}}(M)\) is the size of the generating set we construct.

Theorem 4

If M is a ring-weighted multiset automaton with d states converted from a regular expression, then

  1. 1.

    \(\dim {{\mathrm{Ins}}}(M) = d\).

  2. 2.

    \({{\mathrm{Ins}}}(M)\) can be decomposed into a direct sum

    $$\begin{aligned} {{\mathrm{Ins}}}(M) \cong \bigoplus _{\varDelta \subseteq \varSigma } {{\mathrm{Ins}}}_\varDelta (M) \end{aligned}$$

    where \(\mu (w) \in {{\mathrm{Ins}}}_\varDelta (M)\) iff \({{\mathrm{alph}}}(w) = \varDelta \).

Proof

By induction on the structure of the regular expression \(\alpha \).

If \(\alpha \) is unary: the Cayley-Hamilton theorem gives a generating set

\(\{I, \mu (a), \ldots , \mu (a)^{d-1}\}\), which has size d. Moreover, let \({{\mathrm{Ins}}}_\emptyset (M)\) be the span of \(\{I\}\) and \({{\mathrm{Ins}}}_{\{a\}}(M)\) be the span of the \(\mu (a)^i\) (\(i>0\)). The automaton M, by construction, has a state (the initial state) with no incoming transitions. That is, its transition matrix has a zero column, which means that its characteristic polynomial has no I term. Therefore, if \(w \ne \epsilon \), \(\mu (w) \in {{\mathrm{Ins}}}_{\{a\}}(M)\).

If \(\alpha = k\alpha _1\), then \({{\mathrm{Ins}}}(M) = {{\mathrm{Ins}}}(M_1)\), so both properties hold of \({{\mathrm{Ins}}}(M)\) if they hold of \({{\mathrm{Ins}}}(M_1)\).

If \(\alpha = \alpha _1 \cup \alpha _2\), the inside weights of \(M_{1} \cup M_{2}\) for w are

$$\begin{aligned} \mu (w) = \prod _{a \in w} \mu (a) = \prod _{a} \begin{bmatrix} \mu _1(a)&0 \\ 0&\mu _2(a) \end{bmatrix} = \begin{bmatrix} \prod _a \mu _1(a)&0 \\ 0&\prod _a \mu _2(a) \end{bmatrix} = \begin{bmatrix} \mu _1(w)&0 \\ 0&\mu _2(w) \end{bmatrix}. \end{aligned}$$

Thus, \({{\mathrm{Ins}}}(M) \cong {{\mathrm{Ins}}}(M_1) \oplus {{\mathrm{Ins}}}(M_2)\), and \(\dim {{\mathrm{Ins}}}(M) = \dim {{\mathrm{Ins}}}(M_1) + \dim {{\mathrm{Ins}}}(M_2)\). Moreover, \({{\mathrm{Ins}}}_\varDelta (M) \cong {{\mathrm{Ins}}}_\varDelta (M_1) \oplus {{\mathrm{Ins}}}_\varDelta (M_2)\).

If \(\alpha = \alpha _1 \alpha _2\), the inside weights of for w are

$$\begin{aligned} \mu (w)&= \prod _{a \in w} \mu (a) = \prod _{a \in w} (\mu _1(a) \otimes \kappa _2(a) + I \otimes \mu _2(a)) \\&= \sum _{uv=w} \left( \prod _{a \in u} \mu _1(a) \otimes \prod _{a \in u} \kappa _2(a) \prod _{a \in v} \mu _2(a) \right) \\&= \sum _{uv=w} \mu _1(u) \otimes \kappa _2(u) \mu _2(v) \end{aligned}$$

where we have used Lemma 1 and properties of the Kronecker product. Let \(\{e_i\}\) and \(\{f_i\}\) be a generating set for \({{\mathrm{Ins}}}(M_1)\) and \({{\mathrm{Ins}}}(M_2)\), respectively. Then the above can be written as a linear combination of terms of the form \(e_i \otimes \kappa _2(u) f_j\). We take these as a generating set for \({{\mathrm{Ins}}}(M)\). Although it may seem that there are too many generators, note that if both \(\mu _1(u)\) and \(\mu _1(u')\) depend on \(e_i\), they belong to the same submodule and therefore use the same symbols, so \(\kappa _2(u) = \kappa _2(u')\) (Lemma 1.1). Therefore, the \(e_i \otimes \kappa _2(u) f_j\) form a generating set of size \(\dim {{\mathrm{Ins}}}(M_1) \cdot \dim {{\mathrm{Ins}}}(M_2)\).

Moreover, let \({{\mathrm{Ins}}}_\varDelta (M)\) be the submodule spanned by all the \(\mu _1(u) \otimes \kappa _2(u) \mu _2(v)\) such that \({{\mathrm{alph}}}(uv) = \varDelta \).    \(\square \)

6 Conclusion

We have examined weighted multiset automata, showing how to construct them from weighted regular expressions, how to learn weights automatically from data, and how, in certain cases, inside weights can be computed more efficiently in terms of both time and space complexity. We leave implementation and application of these methods for future work.