1 Introduction

Over the past twenty years, applications of software verification have scaled from small academic programs to finding errors in the GNU Coreutils [7]. In principle, the employed verification strategies involve exploring the control-flow-graph of the program, gathering constraints over program variables and passing these constraints to a constraint solver. The primary worker of software verification is thus the constraint solver, and the scalability of software verification achieved by improving the efficiency of constraint solvers. The theories supported by constraint solvers are likewise highly influenced by the needs of software verification tools (e.g. array theory and bitvector arithmetic). A recent need of software verification tools is the ability to cope with equations involving string constraints, i.e. equations over string variables composing equality between concatenation of strings and string variables. This need arose from the desire to do software verification of languages with string manipulation as a core part of the language (e.g. JavaScript and Java) [9, 19]. To accomplish this goal, we have seen the advent of dedicated string solvers as well as constraint solvers implementing string solving techniques. As an incomplete list we mention HAMPI [15], CVC4 [4], Ostrich [8], Sloth [11], Norn [1], S3P [20] and Z3str3 [5].

Although the need for string solving only recently surfaced in the software verification community, the problem is in fact older and known as Word Equations (a term that we will use from now on). The word equation satisfiability problem is to determine whether we can unify the two strings, i.e., transform them into two equal strings containing constant letters only, by substituting the variables consistently by strings of constants. For example, consider the equation defined by the two strings XabY and aXYb, denoted \(X a b Y \,\dot{=}\,a XY b\), with variables XY and constants a and b. It is satisfiable because X can be substituted by a and Y by b, which produces the equality \(aabb = aabb\). In fact, substituting X by an arbitrary amount of a’s and Y by an arbitrary amount of b’s unifies the two sides of the equation.

The word equation problem is decidable [16] and NP-hard. In a series of works, Jeż [12, 13] showed that word equations can be solved in non-deterministic linear space. It has been shown by Plandowski [18] that there exists an upper bound of \(2^{2^{O(n^4)}}\) for the smallest solution to a word equation of length n. Having this in mind, a standard method for solving word equations is known as filling the positions [14, 17]. In this method a length for each of the string variables is non-deterministically selected. Having a fixed length of the variables reduces the problem to lining up the positions of the two sides of the equation, and filling the unknown positions of the variables with characters, making the two sides equal.

In this paper we present a new solver for word equation with linear length constraints, Woorpje. In particular, it guesses the maximal length of variables and encodes a variation of filling the positions method into an automata-construction, thereby reducing the search for a solution to a reachability question of this automata. Preliminary experiments with a pure automata-reachability-based approach revealed however, that this would not scale for even small word equations. Woorpje therefore encodes the automata into SAT and uses the tool Glucose [3] as a backend. Unlike other approaches based on the filling the positions method (e.g. [6, 19]), Woorpje does not need an exact bound for each variable, but only an upper bound. Experiments indicate that Woorpje is not only reliable but also competitive with the more mature CVC4 and Z3. Results indicate that Woorpje is quicker on pure word equations (no linear length constraints), and that CVC4 and Z3 mainly have an edge on word equations with linear constraints. This may be due to our naive solution for solving linear length constraints.

2 Preliminaries

Let be the set of natural numbers, let [n] be the set and \([n]_0\) the set . For a finite set \(\varDelta \) of symbols, we let \(\varDelta ^*\) be the set of all words over \(\varDelta \) and \(\varepsilon \) be the empty word. For an alphabet \(\varDelta \) and \(a \notin \varDelta \), we let \(\varDelta _a\) denote the set . For a word \(w = x_0x_1\dots x_{n-1}\) we let \(|w| = n\) refer to its length. For \(i \in \left[ |w|\right] \) we denote by w[i] the symbol on the ith position of w i.e. \(w[i] = x_i\). For \(a\in \varDelta \) and \(w\in \varDelta ^*\) we let \(|w|_a\) denote the number of as in w. If \(w=v_1v_2\) for some words \(v_1,v_2\in \varDelta ^*\), then \(v_1\) is called a prefix of w and \(v_2\) is a suffix of w. In the remainder of the paper, we let \(\varXi = \varSigma \cup \varGamma \) where \(\varSigma \) (\(\varGamma \)) is a set of symbols called letters (variables) and \(\varSigma \cap \varGamma =\emptyset \). We call a word \(w\in \varXi ^*\) a pattern over \(\varXi \). For a pattern \(w\in \varXi ^*\) we let \(\mathrm{var}(w)\subseteq \varGamma \) denote the set of variables from \(\varGamma \) occurring in w. A substitution for \(\varXi \) is a morphism \(S : \varXi ^* \rightarrow \varSigma ^*\) with \(S(a) = a\) for every \(a \in \varSigma \) and \(S(\varepsilon ) = \varepsilon \). Note, that to define a substitution S, it suffices to define S(X) for all \(X\in \varGamma \).

A word equation over \(\varXi \) is a tuple \((u,v) \in \varXi ^* \times \varXi ^*\) written \(u \,\dot{=}\,v\). A substitution S over \(\varXi \) is a solution to a word equation \(u\,\dot{=}\,v\) (denoted \(S {{\,\mathrm{\models }\,}}u\,\dot{=}\,v\)) if \(S(u) = S(v)\). A word equation \(u \,\dot{=}\,v\) is satisfiable if there exists a substitution S such that \(S {{\,\mathrm{\models }\,}}u\,\dot{=}\,v\). A system of word equations is a set of word equations \(P \subseteq \varXi ^* \times \varXi ^*\). A system of word equations P is satisfiable if there exists a substitution S that is a solution to all word equations (denoted \(S \,\models \,E\)). Karhumäki et al. [14] showed that for every system of word equations, a single equation can be constructed which is satisfiable if and only if the initial formula was satisfiable. The solution to the constructed word equation can be directly transferred to a solution of the original word equation system.

Bounded Word Equations. A natural sub-problem of solving word equations is that of Bounded Word Equations. In this problem we are not only given a word equation \(u \,\dot{=}\,v\) but also a set of length constraints \(\{|X| \le b_X \mid X \in \varGamma \wedge b_X \in {\mathrm{I\!N}}\}\). The bounded word equation is satisfiable if there exists a substitution S such \(S{{\,\mathrm{\models }\,}}u\,\dot{=}\,v\) and \(|S(X)| \le b_X\) for each \(X\in \varGamma \). For convenience, we shall sometimes refer to the set of bounds \(b_X\) as a function \(B: \varGamma \rightarrow {\mathrm{I\!N}}\) such that \(b_X = B(X)\).

Word Equations with Linear Constraints. A word equation with linear constraints is a word equation \(u \,\dot{=}\,v\) accompanied by a system \(\theta \) of linear Diophantine equations, where the unknowns correspond to the lengths of possible substitutions of the variables in \(\varGamma \). A word equation with linear constraints is satisfiable if there exists a substitution S such that \(S{{\,\mathrm{\models }\,}}u\,\dot{=}\,v\) and S satisfies \(\theta \). Note that the bounded word equation problem is in fact a special case of word equations with linear constraints.

SAT Solving. A Boolean formula \(\varphi \) with finitely many Boolean variables is usually given in conjunctive normal form. This is a conjunction over a set of disjunctions (called clauses), i.e. \(\varphi = \bigwedge _i \bigvee _j l_{i,j}\), where is a literal. A mapping is called an assignment; for such an assignment, the literal l evaluates to true if and only if \(l = x_i\) and \(\beta (x_i) = 1\), or \(l = \lnot x_i\) and \(\beta (x_i) = 0\). A clause inside a formula in conjunctive normal form is evaluated to true if at least one of its literals evaluates true. We call a formula \(\varphi \) satisfied (under an assignment) if all clauses are evaluated to true. If there does not exists a satisfying assignment, \(\varphi \) is unsatisfiable.

3 Word Equation Solving

In this section we focus on solving Bounded Word Equations and Word Equations with Linear Constraints. We proceed by first solving bounded word equations, and secondly, we discuss a minor change, that allows solving word equations with linear constraints.

3.1 Solving Bounded Word Equation

Recall that a bounded word equation consists of a word equation \(u \,\dot{=}\,v\) along with a set of equations \(\{|X| \le b_X \}\) providing upper bounds for the solution of each variable X. In our approach we use these bounds to create a finite automaton which has an accepting run if and only if the bounded word equation is satisfiable.

Before the actual automata construction, we need some convenient transformations of the word equation itself. For a variable X with length bound \(b_X\), we replace X with a sequence of new ‘filled variables’ \(X^{(0)} \cdots X^{(b_X-1)}\) which we restrict to only be substituted by either a single letter or the empty word. A pattern containing only filled variables, as well as letters, is called a filled pattern. For a pattern \(w \in \varXi ^*\) we denote its corresponding filled pattern by \(w_\xi \). In the following, we refer to the alphabet of filled variables by \(\varGamma _\xi \) and by \(\varXi _\xi = \varSigma \cup \varGamma _\xi \) the alphabet of the filled patterns. Let \(S : \varXi ^* \rightarrow \varSigma ^*\) be a substitution for \(w \in \varXi ^*\). We can canonically define the induced substitution for filled patterns as \(S_\xi : \left( \varSigma \cup \varGamma _\xi \right) \rightarrow \varSigma _\lambda \) with \(S_\xi (a) = S(a)\) for all \(a \in \varSigma \), \(S_\xi (X^{(i)}) = S(X)[i]\) for all \(X^{(i)} \in \varGamma _\xi \) and \(i < |S(X)|\), and \(S_\xi (X^{(j)}) = \lambda \) for all \(X^{(j)} \in \varGamma _\xi \) and \(|S(X)| \le j < b_X\). Here, \(\lambda \) is a new symbol (\(\lambda \notin \varXi _\xi \)) to indicate an unused position at the end of a filled variable. Note that the substitution of a single filled variable always maps to exactly one character from \(\varSigma _\lambda \), and, as soon as we discover \(S_\xi \left( X^{(j)}\right) = \lambda \) for \(j \in \left[ b_X\right] \) it also holds that \(S_\xi \left( X^{(i)}\right) = \lambda \) for all \(j\le i < b_X\). In a sense, the new element \(\lambda \) behaves in the same way as the neutral element of the word monoid \(\varSigma ^*\), being actually a place holder for this element \(\varepsilon \). In the other direction, if we have found a satisfying filled substitution to our word equation, the two filled patterns obtained from the left hand side and the right hand side of an equation, respectively, we can transform it to a substitution for our original word equation by defining S(X) as the concatenation \(S_\xi \left( X^{(0)}\right) \dots S_\xi \left( X^{(i)}\right) \) in which each occurrence of \(\lambda \) is replaced by the empty word \(\varepsilon \), for all \(X \in \varGamma \) and \(i \in [b_X]\).

Our goal is now to build an automaton which calculates a suitable substitution for a given equation. During the calculation there are situations where a substitution does not form a total function. To extend a partial substitution we define for \(X \in \varXi \) and \(b \in \varSigma ^*\) the notation whenever S(X) is undefined and otherwise \(S\left[ \frac{X}{b}\right] = S\). This definition can be naturally applied to filled substitutions. We define a congruence relation which sets variables and letters in relation whenever their substitution with respect to a partial substitution \(S_\xi \) is equal or undefined. For all \(a,b \in \varXi _\xi \cup \{\lambda \}\) we define

$$ a \overset{S_\xi }{\sim }b \text { iff } S_\xi \left( a\right) = S_\xi \left( b\right) \text { or } S_\xi \left( b\right) \notin \varSigma ^*_\lambda \text { or } S_\xi \left( a\right) \notin \varSigma ^*_\lambda . $$

Definition 1

For a word equation \(u\,\dot{=}\,v\) for \(u,v \in \varXi ^*\) and a mapping \(B : \varGamma \rightarrow {\mathrm{I\!N}}\) defining the bounds \(B(X)=b_X\), we define the equation automaton \(A (u\,\dot{=}\,v,B) = (Q,\delta ,I,F)\) where is a set of states consisting of two integers which indicate the position inside the two words \(u_\xi \) and \(v_\xi \) and a partial substitution, the transition function \(\delta : Q \times \varSigma _\lambda \rightarrow Q\) defined by

an initial state and the set of final states .

The state space of our automaton is finite since the filled substitution \(S_\xi \) maps each input to exactly one character in \(\varSigma \). The automaton is nondeterministic, as the three choices we have for a transition are not necessarily mutually exclusive.

As an addition to the above definition, we introduce the notion of location as a pair of integers (ij) corresponding to two positions inside the two words \(u_\xi \) and \(v_\xi \). A location (ij) can also be seen as the set of states of the form ((ij), S) for all possible partial substitutions S.

A run of the above nondeterministic automaton constructs a partial substitution for the given equation which is extended with each change of state. The equation has a solution if one of the accepting states \((|u_\xi |,|v_\xi |,S)\), where S is a total substitution, is reachable, because the automaton simulates a walk through our input equation left to right, with all its positions filled in a coherent way.

Example 1

Consider the equation \(u \,\dot{=}\,v\) for \(u =aZXb, v= aXaY \in \varXi ^*\). We choose the bounds \(b_X = b_Y = b_Z = 1\). This will give us the words \(u_\xi = aZ^{(0)}X^{(0)}b\) and \(v_\xi = aX^{(0)}aY^{(0)}\). Figure 1 visualizes the corresponding automaton. A run starting with the initial substitution reaching one of the final states gives us a solution to the equation. In this example we get the substitutions \(Z \mapsto a, X \mapsto a, Y \mapsto b\) and \(Z \mapsto a, X \mapsto \varepsilon , Y \mapsto b\).

Fig. 1.
figure 1

Automaton for the word equation \(aZXb \,\dot{=}\,aXaY\), with the states grouped according to their locations. Only reachable states are shown.

Theorem 1

Given a bounded word equation \(u \,\dot{=}\,v\) for \(u,v \in \varXi ^*\),with bounds B, then the automaton \(A (u \,\dot{=}\,v,B)\) reaches an accepting state if and only if there exists S such that \(S {{\,\mathrm{\models }\,}}u \,\dot{=}\,v\) and \(|S(X)| \le B(X)\) for all \(X \in \varGamma \).

SAT Encoding. We now encode the solving process into propositional logic. For that we impose an ordering on the finite alphabets and for . Using the upper bounds given for all variables \(X \in \varGamma \), we create the filled variables alphabet \(\varGamma _\xi \). Further, we create the Boolean variables \(\mathsf {K}^a_{X^{(i)}}\), for all \(X^{(i)} \in \varGamma _\xi \), \(a \in \varSigma _\lambda \) and \(i \in [b_X]\). Intuitively, we want to construct our formula such that an assignment \(\beta \) sets \(\mathsf {K}^a_{X^{(i)}}\) to 1, if the solution of the word equation, which corresponds to the assignment \(\beta \), is such that at position i of the variable X an a is found, meaning \(S_\xi \left( X^{(i)}\right) = a\). To make sure \(\mathsf {K}^a_{X^{(i)}}\) is set to 1 for exactly one \(a \in \varSigma _\lambda \) we define the clause \(\bigvee _{a \in \varSigma _\lambda } \mathsf {K}^a_{X^{(i)}}\) which needs to be assigned true, as well the constraints \(\mathsf {K}^a_{X^{(i)}} \rightarrow \lnot \mathsf {K}^b_{X^{(i)}}\), for all \(a,b \in \varSigma _\lambda , X \in \varGamma , i \in [b_X]\) where \(a \ne b\), which also need to be all true.

To match letters we add the variables \(\mathsf {C}_{a,a} \leftrightarrow \top \) and \(\mathsf {C}_{a,b} \leftrightarrow \bot \) for all \(a,b \in \varSigma _\lambda \) with \(a \ne b\). As such, the actual encoding of our equation can be defined as follows: for and each position i of w and letter \(a \in \varSigma _\lambda \) we introduce a variable which is true if and only if w[i] will correspond to an a in the solution of the word equation. More precisely, we make a distinction between constant letters and variable positions and define: \(\mathsf {word}_{w[i]}^a \leftrightarrow \mathsf {C}_{w[i],a}\) if \(w[i] \in \varSigma _\lambda \) and \(\mathsf {word}_{w[i]}^a \leftrightarrow \mathsf {K}^a_{w[i]}\) if \(w[i] \in \varGamma _\xi \). The equality of two characters, corresponding to position i in u and, respectively, j in v, is encoded by introducing a Boolean variable \(\mathsf {wm}_{i,j} \leftrightarrow \bigvee _{a \in \varSigma _\lambda } \mathsf {word}_{u[i]}^a \wedge \mathsf {word}_{v[j]}^a\) for appropriate \(i \in [|u_\xi |]\), \(j \in [|v_\xi |]\).

Based on this setup, each location of the automaton is assigned a Boolean variable. As seen in Definition 1 we process both sides of the equation simultaneously, from left to right. As such, for a given equation \(u \,\dot{=}\,v\) we create \(n\cdot m=\left( |u_\xi |+1\right) \cdot \left( |v_\xi |+1\right) \) many Boolean variables \(\mathsf {S}_{i,j}\) for \(i \in [n]\) and \(j \in [m]\). Each variable corresponds to a location in our automaton. The location (0, 0) is our initial location and \((|u_\xi |,|v_\xi |)\) our accepting location. The goal is to find a path between those two locations, or, alternatively, a satisfying assignment \(\beta \), which sets the variables corresponding to these locations to 1. Every path between the location (0, 0) and another location corresponds to matching prefixes of u and v, under proper substitutions. We will call locations where an assignment \(\beta \) sets a variable \(\mathsf {S}_{i,j}\) to 1, active locations. Our transitions are now defined by a set of constraints. We fix \(i \in [n]\) and \(j \in [m]\) in the following. The constraints are given as follows: The first constraint (1) ensures that every active location has at least one active successor. The next three constraints (2)–(4) ensure the validity of the paths we follow: from a location we can only proceed to exactly one other location, in order to find a satisfying assignment; therefore we disallow simultaneous steps in multiple directions. In (5), (6) we forbid using an \(\lambda \)-transition whenever there is another possibility of moving forward. In the same manner we proceed in the case of two matching \(\lambda \) in (7); this part is especially important for finding substitutions which are smaller than the given bounds. The idea applies in the same way for matching letters, whose encoding is given in (8). The actual transitions which are possible from one state to another are encoded in (9) by using our Boolean variables \(\mathsf {wm}_{i,j}\) which are true for matching positions in the two sides of the equation. This constraint allows us to move forward in both words if there was a match of two letters in the previous location. When the transitions are pictured as movements in the plane, this corresponds to a diagonal move. A horizontal or vertical move corresponds to a match with the empty word. The last constraint (10) ensures a valid predecessor. This is supposed to help the solver in deciding the satisfiability of the obtained formula, i.e., to guide the search in an efficient way. It can be seen as a local optimization step.

$$\begin{aligned}&\mathsf {S}_{i,j} \rightarrow \mathsf {S}_{i+1,j} \vee \mathsf {S}_{i,j+1} \vee \mathsf {S}_{i+1,j+1}\end{aligned}$$
(1)
$$\begin{aligned}&\left( \mathsf {S}_{i,j} \wedge \mathsf {S}_{i,j+1}\right) \rightarrow \left( \lnot \mathsf {S}_{i+1,j+1} \wedge \lnot \mathsf {S}_{i+1,j}\right) \end{aligned}$$
(2)
$$\begin{aligned}&\left( \mathsf {S}_{i,j} \wedge \mathsf {S}_{i+1,j}\right) \rightarrow \left( \lnot \mathsf {S}_{i+1,j+1} \wedge \lnot \mathsf {S}_{i,j+1}\right) \end{aligned}$$
(3)
$$\begin{aligned}&\left( \mathsf {S}_{i,j} \wedge \mathsf {S}_{i+1,j+1}\right) \rightarrow \left( \lnot \mathsf {S}_{i,j+1} \wedge \lnot \mathsf {S}_{i+1,j}\right) \end{aligned}$$
(4)
$$\begin{aligned}&\mathsf {S}_{i,j} \wedge \lnot \mathsf {word}_{u[i]}^\lambda \rightarrow \lnot \mathsf {S}_{i+1,j}\text { and }\mathsf {S}_{i,j} \wedge \mathsf {word}_{u[i]}^\lambda \wedge \lnot \mathsf {word}_{v[j]}^\lambda \rightarrow \mathsf {S}_{i+1,j}\end{aligned}$$
(5)
$$\begin{aligned}&\mathsf {S}_{i,j} \wedge \lnot \mathsf {word}_{v[j]}^\lambda \rightarrow \lnot \mathsf {S}_{i,j+1}\text { and }\mathsf {S}_{i,j} \wedge \lnot \mathsf {word}_{u[i]}^\lambda \wedge \mathsf {word}_{v[j]}^\lambda \rightarrow \mathsf {S}_{i,j+1}\end{aligned}$$
(6)
$$\begin{aligned}&\mathsf {S}_{i,j} \wedge \mathsf {word}_{u[i]}^\lambda \wedge \mathsf {word}_{v[j]}^\lambda \rightarrow \mathsf {S}_{i+1,j+1}\end{aligned}$$
(7)
$$\begin{aligned}&\mathsf {S}_{i,j} \wedge \mathsf {S}_{i+1,j+1} \rightarrow \mathsf {wm}_{i,j} \end{aligned}$$
(8)
$$\begin{aligned}&\mathsf {S}_{i,j} \leftrightarrow \left( \mathsf {S}_{i-1,j-1} \wedge \mathsf {wm}_{i-1,j-1} \right) \vee \left( \mathsf {S}_{i,j-1} \wedge \lnot \mathsf {wm}_{i,j-1} \right) \vee \left( \mathsf {S}_{i-1,j} \wedge \lnot \mathsf {wm}_{i-1,j} \right) \end{aligned}$$
(9)
$$\begin{aligned}&\mathsf {S}_{i+1,j+1} \rightarrow \mathsf {S}_{i,j} \vee \mathsf {S}_{i+1,j} \vee \mathsf {S}_{i,j+1} \end{aligned}$$
(10)

The final formula is the conjunction of all constraints defined above. This formula is true iff location (nm) is reachable from location (0, 0), and this is true iff the given word equation is satisfiable w.r.t. the given length bounds.

Lemma 1

Let \(u \,\dot{=}\,v\) be a word equation, B be the function giving the bounds for the word equation variable, and \(\varphi \) the corresponding formula consisting of the conjunction (1)–(10) and the earlier defined constraints in this section, then \(\varphi \wedge \mathsf {S}_{0,0} \wedge \mathsf {S}_{|u_\xi |,|v_\xi |}\) has a satisfying assignment if and only if \(A(u \,\dot{=}\,v,B)\) reaches an accepting state.

Example 2

Consider the word equation \(u \,\dot{=}\,v\) where \(u = XaXbYbZ\) and \(v = aXYYbZZbaa \in \varXi ^*\) where and . Using the approach discussed above, we find the solution \(S(X) = aaaaaaaa\), \(S(Y) = aaaa\) and \(S(Z) = aa\) using the bounds \(b_X = 8\) and \(b_Y = b_Z = 6\). We set up an automaton with \(32 \cdot 38 = 1216\) states to solve the equation. In Fig. 2 we show the computation of the SAT-Solver. Light grey markers indicate states considered in a run of the automaton. In this case only 261 states are needed. The dark grey markers visualize the actual path in the automaton leading to the substitution. Non-diagonal stretches are \(\lambda \) transitions.

Fig. 2.
figure 2

Solver computation on \(XaXbYbZ \,\dot{=}\,aXYYbZZbaa\)

3.2 Refining Bounds and Guiding the Search

Initial experiments revealed a major inefficiency of our approach: most of the locations were not used during the search and only increased the encoding time. The many white markers in Fig. 2 indicating unused locations visualizes the problem. Since we create all required variables \(x \in \varGamma \) and constraints for every position \(i < b_X\), we can reduce the automaton size by lowering these upper bounds. Abstracting a word equation by the length of the variables gives us a way to refine the bounds \(b_X\) for some of the variables \(X \in \varGamma \). By only considering length we obtain a Diophantine equation in the following manner. We assume an ordering on the variable alphabet . We associate to each word equation variable \(X_j\) an integer variable \(I_j\).

Definition 2

For a word equation \(u \,\dot{=}\,v\) with we define its length abstraction by \(\sum _{j \in [n]} \left( |u|_{X_j} - |v|_{X_j}\right) \cdot I_j = \sum _{a \in \varSigma } |v|_a - |u|_a \) for \(j \in [n]\).

If a word equation has a solution S, then so does its length abstraction with variable \(I_j = |S(X_j)|\). Our interest is computing upper bounds for each variable \(X_k \in \varGamma \) relative to the upper bounds of the bounded word equation problem. To this end consider the following natural deductions

where . Whenever \(0< b^\mathsf {S}_{X_k} < b_{X_k}\) holds, we use \(b^\mathsf {S}_{X_k}\) instead of \(b_{x_k}\) to prune the search space.

The length abstraction is also useful because it might give information about the unsatisfiability of an equation: if there is no solution to the Diophantine equation, there is no solution to the word equation. We use this acquired knowledge and directly report this fact. Unfortunately whenever \(|u|_{X}-|v|_{X} = 0\) holds for a variable X we cannot refine the bounds, as they are not influenced by the above Diophantine equation.

Guiding the Search. The length abstraction used to refine upper bounds can also be used to guide the search in the automaton. In particular it can restrict allowed length of one variable based on the length of others. We refer to the coefficient of variable \(I_j\) in Definition 2 by \(\mathsf {Co}_{u,v}(I_j) = \left( |u|_{X_j} - |v|_{X_j}\right) \).

To benefit from the abstraction of the word equation inside our propositional logic encoding we use Reduced Ordered Multi-Decision Diagrams (MDD) [2]. An MDD is a directed acyclic graph, with two nodes having no outgoing edges (called true and false terminal nodes). A Node in the MDD is associated to exactly one variable \(I_j\), and has an outgoing edge for each element of \(I_j\)’s domain. In the MDD, a node labelled \(I_j\) is only connected to nodes labelled \(I_{j+1}\). A row (\(\mathsf {r}(I_j)\)) in an MDD is a subset of nodes corresponding to a certain variable \(I_j\).

We create the MDD following Definition 2. The following definition creates the rows of the MDD recursively. An MDD node is a tuple consisting of a variable \(I_j\) and an integer corresponding the partial sum which can be obtained using the coefficients and position information of all previous variables \(I_k\) for \(k < j\). We introduce a new variable \(I_{-1}\) labelling the initial node of the MDD. The computation is done as follows:

(11)

and . Since \(I_j\) is associated to the word equations variable \(X_j\), we let \(\mathsf {r}(X_{j}) = \mathsf {r}(I_{j})\). We denote the whole set of nodes in the MDD by . The \(\mathtt {true}\) node of the MDD is \((I_{n-1}, \mathsf {s_\#})\), where \( \mathsf {s_\#} = \sum _{a \in \varSigma } |v|_a - |u|_a\). If the initial creation of nodes did not add this node, the given equation (Definition 2) is not satisfiable hence the word equation has no solution given the set bounds. Furthermore there is no need to encode the full MDD, when only a subset of its nodes can reach \((I_{n-1},\mathsf {s_\#} )\). For reducing the MDD nodes to this subset, we calculate all predecessors of a given node \((I_i,s) \in M^C\) as follows

The minimized set \(M = F(T)\) of reachable nodes starting at the only accepting node is afterwards defined through a fixed point by

$$\begin{aligned} T \subseteq F(T) \wedge \left( \forall \; p \in F(T) : q \in \mathsf {pred}(p) \wedge q \in M^C \Rightarrow q \in F(T)\right) \end{aligned}$$
(12)

We continue by encoding this into a Boolean formula. For that we need information on the actual length of a possible substitution. We reuse the Boolean variables of our filled variables \(X \in \varGamma _\xi \). The idea is to introduce \(b_X+1\) many Boolean variables \((\mathsf {OH}_i(0) \dots \mathsf {OH}_i(b_X+1)) \) for each \(X_i \in \varGamma \), where \(\mathsf {OH}_i(j)\) is true if and only if \(X_i\) has length j in the actual substitution. This kind of encoding is known as a one-hot encoding. To achieve this we add a constraint forcing substitutions to have all \(\lambda \) in the end. We force our solver to adapt to this by adding clauses \(\mathsf {K}^{\lambda }_{X^{(j)}} \rightarrow \mathsf {K}^{\lambda }_{X^{(j+1)}}\) for all \(j \in [b_{X_i}-1]\) and \(X_i^{(j)} \in \varGamma _\xi \). The actual encoding is done by adding the constraints \(\mathsf {OH}_i(0) \leftrightarrow \mathsf {K}^{\lambda }{X_i^{(0)}}\) and \(\mathsf {OH}_i(b_{X_i}) \leftrightarrow \lnot \mathsf {K}^{\lambda }_{X^{(b_{X_i}-1)}}\), which fixes the edge cases for the substitution by the empty word and no \(\lambda \) inside it. For all \(j \in [b_{X_i}]_0\), we add the constraints \(\mathsf {OH}_i(j)\leftrightarrow \mathsf {K}^{\lambda }_{X_i^{(j)}} \wedge \lnot \mathsf {K}^{\lambda }_{X_i^{(j-1)}}\), which marks the first occurrence of \(\lambda \). The encoding of the MDD is done nodewise by associating a Boolean variable \(\mathsf {M}_{i,j}\) for each \(i \in [|\varGamma |]\), where \((I_i,j) \in M\). Our goal is now to find a path inside the MDD from node \((I_{-1},0)\) to \((I_{n-1},\mathsf {s_\#})\). Therefore we enforce a true assignment for the corresponding variables \(\mathsf {M}_{-1,0}\) and \(\mathsf {M}_{{n-1},\mathsf {s_\#}}\). A valid path is encoded by the constraint \(\mathsf {M}_{i-1,j} \wedge \mathsf {OH}_i(k) \rightarrow \mathsf {M}_{i,s}\) for each variable \(X_i \in \varGamma \), \(k \in [b_{X_i}]_0\), where \(s = j + k \cdot \mathsf {Co}_{u,v}(X_i)\) and \((I_i,s) \in M\). This encodes the fact that whenever we are at a node \((I_{i-1},s) \in M\) and the substitution for a variable \(X_i\) has length k (\(|S(X_i)| = k\)), we move on to the next node, which corresponds to \(X_{i}\) and an integer obtained by taking the coefficient of the variable \(X_i\), multiplying it by the substitution length, and adding it to the previous partial sum s. Whenever there is only one successor to a node \((I_i,j)\) within our MDD, we directly force its corresponding one hot encoding to be true by adding \(\mathsf {M}_{i-1,j} \rightarrow \mathsf {OH}_i(j)\). This reduces the amount of guesses on variables.

Fig. 3.
figure 3

The MDD for \(aX_2X_0b \doteq aX_0aX_1 \)

Example 3

Consider the equation \(u \,\dot{=}\,v\) for \(u = aX_2X_0b, v= aX_0aX_1 \in \varXi ^*\), where and . The corresponding linear equation therefore has the form \(0\cdot I_0 + (-1) \cdot I_1 + 1 \cdot I_2 = 0\) which gives us the coefficients \(\mathsf {Co}_{u,v}(X_0) = 0\), \(\mathsf {Co}_{u,v}(X_1) = -1\) and \(\mathsf {Co}_{u,v}(X_2) = 1\). For given bounds \(b_{X_0} = b_{X_1} = b_{X_2} = 2\) the induced MDD has the form shown in Fig. 3. In this example \(s_\# = 0\), and therefore \((I_2,0)\) is the only node connected to the true node. The minimization of the MDD by using the fixed point decribed in (12) removes all grey nodes, since they are not reachable starting at the true node. The solver returns the substitution \(S(X_0) = \varepsilon \), \(S(X_0) = b\) and \(S(X_0) = a\). It took the centred path consisting of the nodes \((I_{-1},0),(I_0,0),(I_1,-1),(I_2,0),\texttt {true}\) inside the MDD.

Adding Linear Length Constraints. Until now we have only concerned ourselves with bounded word equations. As mentioned in the introduction however, bounded equations with linear constraints are of interest as well. In particular, without loss of generality we restrict to linear constraints of the form [2] \(c_0 I_0 + \cdots + c_{n-1} I_{n-1} \le c\) where \(c, c_i \in {\mathrm{\mathbb {Z}}}\) are integer coefficients and \(I_i\) are integer variables with a domain and a corresponding \(d_i\in {\mathrm{I\!N}}\). Each \(I_i\) corresponds to the length of a substitution to a variable of the given word equation.

Notice that the structure of the linear length constraint is similar to that of Definition 2. For handling linear constraints we can adapt the generation of MDD nodes to keep track the partial sum of the linear constraint, and define the accepting node of the MDD to one where all rows have been visited and the inequality is true. We simply extend the set T which was used in the fix point iteration in (12) to the set .

4 Experiments

The approach described in the previous sections has been implemented in the tool Woorpje. The inner workings of Woorpje is visualised in Fig. 4. Woorpje first has a preprocessing step where obviously satisfiable/unsatisfiable word equations are immediately reported.

Fig. 4.
figure 4

Architecture of Woorpje

After the preprocessing step, Woorpje iteratively encodes the word equation into a propositional logic formula and solves it with Glucose [3] for increasing maximal variable lengths (\(i^2\), where i is the current iteration). If a solution is found, it is reported. The maximal value of i is user definable, and by default set to \(2^n\) where n is the length of the given equation. If Woorpje reaches the given bound without a verdict, it returns unknown.

We have run Woorpje and state of the art word equation solvers (CVC4 1.6, Norn 1.0.1, Sloth 1.0, Z3 4.8.4) on several word equation benchmarks with linear length constraints. The benchmarks range from theoretically-interesting cases to variations of the real-world application set Kaluza [19]. All tests were performed on Ubuntu Linux 18.04 with an Intel Xeon E5-2698 v4 @ 2.20 GHz CPU and 512 GB of memory with a timeout of 30 s.

We used five different kind of benchmarks. The first track (I) was produced by generating random strings, and replacing factors with variables at random, in a coherent fashion. This guarantees the existence of a solution. The generated word equations have at most 15 variables, 10 letters, and length 300. The second track (II) is based on the idea in Proposition 1 of [10], where the equation \(X_naX_nbX_{n-1}bX_{n-2}\cdots bX_1\doteq aX_nX_{n-1}X_{n-1}bX_{n-2}X_{n-2}b \cdots b X_{1}X_{1}baa\) is shown to have a minimal solution of exponential length w.r.t. the length of the equation. The third track (III) is based on the second track, but each letter b is replaced by the left hand side or the right hand side of some randomly generated word equation (e.g., with the methods from track (I)). In the fourth track (IV) each benchmark consists of a system of 100 small random word equations with at most 6 letters, 10 variables and length 60. The hard aspect of this track is solving multiple equations at the same time. Within the fifth track (V) each benchmark enriches a system of 30 word equations by suitable linear constraints, as presented in this paper. This track is inspired by the Kaluza benchmark set in terms of having many small equations enriched by linear length constraints. All tracks, except track II which holds 9 instances, consist of 200 benchmarks. The full benchmark set is available at https://www.informatik.uni-kiel.de/~mku/woorpje. Table 1 is read as follows: is the count of instances classified as correctly, where marks the incorrect classified cases. For instances marked with the solver returned no answer but terminated before the timeout of 30 s was reached, where in marked cases the solver was killed after 30 s. The row marked by states the overall solving time. The produced substitutions were checked regarding their correctness afterwards. The classification of was done by ad-hoc case inspection whenever not all solvers agreed on a result. In the cases one solver produced a valid solution, and others did not, we validated the substitutions manually. For the cases where one solver determined an equation is unsatisfied and all others timed out, we treated the unsat result as correct. This means that we only report errors if a solver reports unsat and we know the equation was satisfiable. During our evaluation of track I CVC4 crashed with a null-pointer exception regarding the word equation \(dbebgddbecfcbbAadeeaecAgebegeecafegebdbagddaadbddcaeeebfabfefabfacdgAgaabgegagf \,\dot{=}\,dbebgddbeAfcbbAaIegeeAaDegagf\), where lowercase symbols are letters and uppercase symbols are variables. Worth mentioning is the reporting of 14 satisfiable benchmarks by the tool Sloth without being able to produce a valid model, while at least two other tools classified them as unsatisfiable. We treated this as an erroneous behaviour.

Table 1. Benchmark results ( : correct classified, : reported unknown, : timed out after 30 s, : incorrectly classified, : total Time in seconds)

The result shows that Woorpje produces reliable results (0 errors) in competitive time. It outperforms the competitors in track I, III and IV and sticks relatively tight to the leaders Z3str3, Z3Seq and CVC4 on track V. On track II Woorpje trails CVC4 and Z3Str3. The major inefficiency of Woorpje is related to multiple equations with large alphabets and linear length constraints.

It is worth emphasising, that the benchmarks developed here seem of intrinsic interest, as they challenge even established solvers.

5 Conclusion

In this paper we present a method for solving word equations by using a SAT-Solver. The method is implemented in our new tool Woorpje and experiments show it is competitive with state-of-the-art string solvers. Woorpje solves word equations instances that other solvers fail to solve. This indicates that our technique can complement existing techniques in a portfolio approach.

In the future, we aim to extend our approach to include regular constraints. As our approach relies on automata theory, it is expected that this could be achievable. Another step is the enrichment of our linear constraint solving. We currently do a basic analysis by using the MDDS. There are a few refinement steps described in [2] which seem applicable. A next major step is to develop a more efficient encoding of the alphabet of constants. Currently the state space explodes due to the massive branching caused by the usage of large alphabets.