Keywords

1 Introduction

Constraint learning is the task of acquiring constraint satisfaction or optimization problems from examples of solutions and non-solutions or other types of supervision. This document overviews selected topics in constraint learning and has no ambition of completeness. More specifically, we will discuss learning of hard constraints, which implicitly define a satisfaction problem; learning of soft constraints, where competing constraints are assigned different preferences; and interactive learning of hard or soft constraints, where the learning algorithm obtains supervision by interacting with an oracle (e.g. a human expert, a non-expert, a measurement apparatus). For a more in-depth guide to constraint learning and related areas, we refer the interested reader to the works by O’Sullivan [35], Bessiere et al. [9], Lombardi et al. [30], and De Raedt et al. [15]. Further pointers to the literature are provided in Sect. 6.

Why Constraint Learning?

Constraints are extremely popular from a modeling perspective, and appear in all sort of satisfaction and optimization problems in both artificial intelligence and operations research. Plenty of frameworks for modelling and solving problems involving constraints exist, from propositional satisfiability (SAT) and linear programming (LP), to constraint satisfaction [42], to more sophisticated alternatives that combine combinatorial and numerical elements, like mixed-integer linear programming (MILP). Given a formal specification of a constraint satisfaction or optimization problem of interest, it is often sufficient to feed it to an appropriate solver to obtain an appropriate solutionFootnote 1.

A major bottleneck of this setup is that obtaining a formal constraint theory is non-obvious: designing an appropriate, working constraint satisfaction or optimization problem requires both domain and modeling expertise. For this reason, in many cases a modeling expert is hired and has to interact with domain expert to acquire informal requirements and turn them into a valid constraint theory. This process is can be expensive and time consuming.

The idea is then to replace or assist this process by acquiring a constraint theory directly from examples. In the simplest setting, when learning hard constraints—which implicitly define constraint satisfaction models like propositional concepts [54], satisfiability modulo theory formulas [5], and general constraint theories over discrete variables [9]—one is given a data set of positive and negative configurations and searches for a theory that covers (i.e. classifies as feasible) all of the positive examples and none of the negative ones. At its core, this is a form of concept learning [9]. Several variants and extensions of this setup have been proposed, including learning of soft constraints [41], which can be violated and are assigned different degrees of importance, and full-fledged optimization problems [36]. In the following, we will briefly cover the most prominent of these settings.

Of course, in practice the learned constraint theory may be approximate or wrong. In this case, two things can occur: either more data is provided and the theory is refined accordingly, or a human expert can revise the model via inspection and debugging. Here the goal is not to replace human experts, but rather to aid them by generating a reasonable initial theory consistent with all available supervision. Notice that in some settings the model does not have to be perfect. For instance, in recommendation the goal is to acquire a soft constraint theory that knows enough about the preferences of the target user to be able to provide reasonable personalized recommendations: so long as the model manages to identify some interesting products, the goal is met [39]. In this case, approximate models are acceptable and no human intervention is needed.

Dimensions of Constraint Learning

Formalisms and approaches to constraint learning can be roughly grouped based on three criteria:

  1. (a)

    The type of constraints being learned. One can learn hard constraints, which define pure satisfaction models, soft constraints, which implicitly define optimization models, or (in principle) both. We will consider approaches to learning hard and soft constraints in Sects. 3 and 4, respectively. There is essentially no literature on learning both hard and soft constraints, so we will skip this topic.

  2. (b)

    The technique used to represent and search over the set of candidate constraints or constraint theories. We will consider both search-based and syntax-guided synthesis approaches for learning hard constraints in Sect. 3, as well as optimization-based approaches in Sects. 4 and 5.

  3. (c)

    Whether learning occurs from a pre-existing data set (passive learning) or by interactively asking questions to an oracle (interactive learning). Section 5 is dedicated to this last setting.

In the next Section, we proceed by introducing a simple, general formalism for expressing hard and soft constraint theories.

2 Constraint Theories

There are a number of successful formalisms for modeling and solving constraint programming problems [42]. In this overview we will restrict ourselves to a very general but minimal notation, to avoid as much overhead as possible.

Let us start by establishing some basic notation. In the following, variables will be written in upper-case X, constants in lower-case x, and value assignments \(X = x\). For simplicity, all variables will take values in the same domain \(\mathcal {X}\subset \mathbb {Z}\) or \(\mathcal {X}= \mathbb {R}\), unless otherwise specified. Vectors of variables will be written in upper-case bold \(\varvec{X}= (X_1, \ldots , X_n)\) and total value assignments \(X_1 = x_1 \wedge \ldots \wedge X_n = x_n\) simply as \(\varvec{X}= \varvec{x}\). Total assignments are also called configurations. The indicator function \(\mathbbm {1}\left\{ \varphi \right\} \) evaluates to 1 if condition \(\varphi \) holds and to 0 otherwise. Throughout the paper, we will use the terms “model”, “constraint theory”, and “constraint satisfaction/optimization problem” interchangeably.

In our simple framework, a constraint theory (or constraint network [9]) is defined by a set of variables \(\varvec{X}\) and by a set of constraints \(c_1, \ldots , c_m\). A constraint with n arguments \(c_j(X_1, \ldots , X_n)\) distinguishes between feasible and infeasible configurations. For instance, the constraint \(c_j(X_1, X_2) = X_1 \vee X_2\), where \(X_1\) and \(X_2\) are Boolean, specifies that the configurations \((\texttt {true}, \texttt {true})\), \((\texttt {true}, \texttt {false})\), and \((\texttt {false}, \texttt {true})\) are feasible, while \((\texttt {false}, \texttt {false})\) is not. A constraint with n arguments can be more formally defined as an n-ary relation [9], but we will leave such technicalities aside. Further, constraints over continuous variables are also possible, e.g., consider the linear constraint \(2 X_1 - 3 X_2 \le 15\) or the mixed non-linear constraints \(\lnot \text {Slim} \implies (H < 110) \vee (\pi \cdot R^2 \ge 1000)\), where \(X_1, X_2, H, R \in \mathbb {R}\) and \(\text {Slim}\) is Boolean. The set of constraints \(c_1, \ldots , c_m\) is usually implicitly conjoined, but keep in mind that this does not hold for soft constraint theories; see Sects. 4 for some counter-examples.

In this overview, we will consider different kinds of constraint theories over both discrete and continuous variables, including:

  • Propositional logic theories (or formulas) consist of Boolean variables, \(\mathcal {X}= \{\texttt {true}, \texttt {false}\}\), combined with the usual logical connectives: conjunction \(\wedge \), disjunction \(\vee \), and negation \(\lnot \). Propositional formulas are most often encountered in conjunctive or disjunctive normal form. An example may be:

    $$ (\text {Saturday} \vee \text {Sunday}) \wedge \text {Sunny} \wedge \lnot \text {Bored} \wedge \lnot \text {Sick} $$

    which is a possible definition for the concept of “fun weekend”. The main inference task is satisfiability (SAT) [21], where the goal is to find a total value assignment \(\varvec{X}= \varvec{x}\) that renders the theory true. Extensions include MAX-SAT and weighted MAX-SAT, which will be discussed later.

  • Constraint networks [9] leverage pure propositional logic to general discrete variables and arbitrary constraints, for instance inequality relations \(\le \), \(=\), \(\ne \), ...and global constraints like all-different. Inference is once again a form of satisfaction.

  • Satisfiability modulo theories (SMT for short) extend propositional logic with one or more decidable theories \(\mathcal {T}\) [5]. For instance, satisfiability modulo linear real arithmetic (\(\text {SMT}(\mathcal {L}\mathcal {R}\mathcal {A})\)) combines logic with linear arithmetic over the reals, and therefore introduces continuous variables and arbitrary linear constraints over them. An example \(\text {SMT}(\mathcal {L}\mathcal {R}\mathcal {A})\) formula for describing a happy outdoor weekend is:

    $$ (\text {Saturday} \vee \text {Sunday}) \wedge (\text {Rain} + \text {SoilHumidity} \le 2) $$

    Other decidable theories include linear integer arithmetic, bit-vectors, and uninterpreted functions, but we will stick to \(\mathcal {L}\mathcal {R}\mathcal {A}\) for simplicity. In this case inference is also a form of satisfaction.

  • Linear programs (LP) include both a linear objective function of real variables and a set of implicitly conjoined linear constraints [52]. An LP in canonical form is written as:

    $$\begin{aligned} \max _{\varvec{x}}&\; \varvec{f}^\top \varvec{x}\\ \text {s.t.}&\; \varvec{a}_j^\top \varvec{x}\le b_j&\forall j = 1, \ldots , m \end{aligned}$$

    Here \(\varvec{f}\) is a constant vector that defines the (gradient of) the objective function while \(\varvec{a}_1, \ldots , \varvec{a}_m\) and \(b_1, \ldots , b_m\) specify m linear constraints. Mixed-integer linear programs (MILP) have the same form, but allow both continuous and integer variables. Notice that, for both LPs and MILPs, inference is a form of optimization rather than satisfaction, that is, the model specifies not only a feasible space (like in standard satisfaction) but also a score over alternative feasible configurations.

Of course there are many other kinds of constraint theories, e.g., in database systems and spreadsheet software. In this case, variables can be strings or other objects. However, we will not consider these further, and refer the interested reader to [17, 26] instead.

3 Learning Hard Constraints

Warmup: Learning \(\varvec{k}\) -CNF Theories

Let us start from the simplest constraint learning problem: learning a k-CNF formula. Such formulas are the conjunction of clauses (disjunctions) with at most k literals each, where a literal is either a variable or its negation. For instance, happy weekends are captured by the 2-CNF formula \((\text {Saturday} \vee \text {Sunday}) \wedge \lnot \text {Rainy} \wedge \lnot \text {ImminentDeadline}\).

Now, let there be a hidden k-CNF theory \(\varphi ^*\). Given a set of example configurations labeled based on whether they are feasible with respective to \(\varphi ^*\) (positive) or not (negative), we want to recover \(\varphi ^*\) from the data only. Valiant’s algorithm is a classic strategy to achieve this goal. The idea is simple. First, build the set of all candidate clauses of length at most k over the variables \(\varvec{X}\). Then, taking each positive example in turn, remove from the set of candidates all of the clauses that are inconsistent with the example. This makes sure that, upon scanning over all positive examples, the set of candidates only contains clauses consistent with the data set. Upon termination, the learned formula is retrieved by taking the conjunction of all surviving clauses.

Despite its simplicity, Valiant’s algorithm is PAC (probabilistically approximately correct) algorithm, meaning that the probability over all random data sets that the algorithm works is arbitrarily high so long as there are enough examples [54]. However, in order to work, Valiant’s algorithm requires two key assumptionsFootnote 2: (a) the data set must be noiseless, i.e., there must be no measurement errors on the variables and no corruption on the labels, and (b) the hidden concept \(\varphi ^*\) must be a k-CNF formula, which is not always known in advance. This is the so-called realizable setting. If these assumptions are not satisfied, then the algorithm gives unreliable results.

Encoding and Searching the Space of Candidates

Let us focus on discrete variables only for the time being. During learning, it is convenient to sort the space of candidate theories according to the generality relation \(\succeq _g\). A constraint c is said to be more general than a constraint \(c'\), written \(c \succeq _g c'\), if and only if the feasible set of c contains the feasible set of \(c'\), that is:

$$\begin{aligned} c \succeq _g c' \quad \Longleftrightarrow \quad \forall \varvec{x}\,.\, (\varvec{x}\,\models \,c') \implies (\varvec{x}\,\models \,c) \end{aligned}$$

If c is more general than \(c'\), then \(c'\) is more specific than c, and vice versa. The generality relation induces a lattice over both constraints and constraint theories, which is perhaps the most common way to structure the space of candidates.

Once the set of candidates is given, the question becomes how to efficiently find a candidate theory compatible with the observed examples. In this sense, Valiant’s algorithm can be viewed as a generate-and-test algorithm: it enumerates all candidates and then discards all of the ones incompatible with the data. In doing so, it keeps track of the most specific candidate theory. Beyond Valiant’s algorithm, there exist notable examples of generate-and-test learners, led by the impressive ModelSeeker [7] approach to acquiring global constraints. But let us briefly consider alternative search strategies too.

There are three classic approaches to searching the space of candidates, all based on the above lattice structure:

  • General-to-specific (aka top-down) approaches start from the most general concept (namely, true) and gradually specialize it according to the examples by introducing extra constraints that exclude the observed negatives from the feasible space of the learned theory.

  • Specific-to-general (aka bottom-up) approaches, unsurprisingly, do the converse: they start from a most specific theory or set of theories and incrementally generalize them. It is often the case that the most specific theory is simply the disjunction of all positive examples—which, unless the data is inconsistent, excludes all negative examples. Generalization then boils down to removing constraints or removing conditions from constraints so to enlarge the feasible space of the candidate theory, while keeping all negatives outside of it.

  • Version space approaches keep track of the whole set of candidates at once. The version space (VS) is indeed defined as the set of all theories that are consistent with respect to the dataset [33], and it is the sub-lattice (induced by the generality relation \(\succeq _g\)) between a set of most-general (top) and most-specific (bottom) candidates. In a discrete setting, VS learners leverage incremental bi-directional search, whereby an initial estimate of the VS is incrementally refined iterating over all examples, and for each example checking whether the most general theory wrongly covers it (if it is negative) or whether the most specific candidate wrongly excludes it (if it is positive). In either case the VS is updated. We note in passing that version spaces are not restricted to discrete variables, and that they are used in recent algorithms for both active learning [20] and preference elicitation [12].

In more general terms, learning of hard constraints can be viewed simply as a search problem, and therefore any search algorithm can in principle be used, including stochastic local search, genetic algorithms, etc.

It is worth remarking that, in most cases of interest, the set of candidates is exponential in the number of variables. For instance, for k-CNF, given v variables one can build 2v literals, and thus \((2v)^k\) potential clauses out of them. Learning approaches use smart strategies to avoid enumerating such a humongous set. Perhaps the simplest solution is to leverage the generality relation, by automatically excluding all constraints that are more general than an already excluded one. Alternative approaches include restricting the space of hypotheses by introducing background knowledge, e.g., by restricting the value of k or the initial set of candidates. Providing constraint templates to be filled in, as in sketching [48], is also an option. A sensible alternative is to compactly represent the space of candidates using a satisfaction or optimization problem. This strategy is at the core of syntax-guided synthesis (SyGuS) [1, 2], a general framework for designing programs from specifications and examples. Two notable examples of SyGuS are the celebrated constraint learning Conacq [8], which encodes the version space using a propositional formula, and Incal [27], an approach for learning \(\text {SMT}(\mathcal {L}\mathcal {R}\mathcal {A})\) theories from examples that extends SyGuS to continuous variables.

4 Learning Soft Constraints

Soft constraints are a powerful tool for dealing with conflicting requirements, uncertain inputs, and imperfect specifications. Intuitively, soft constraints introduce two new rules: first, it is not mandatory to satisfy soft constraints, and second, some soft constraints are more important than others [10, 42]. Therefore, if two soft constraints are incompatible, the most preferable one should be satisfied.

Soft constraints with Linear Preferences

In their most general form, preferences over soft constraints can be encoded as a binary relation. Although very flexible, in the worst case encoding a relation over m soft constraints requires \(m^2\) parameters. This is very cumbersome both when manually designing the constraint theory and when learning it from examples.

For this reason, we will focus on a more agile alternative, where the absolute importance of a constraint is determined by an objective function associated to it. The overall quality of a configuration is then given by the sum of the importances of the soft constraints that it satisfies. More formally, a theory in this form includes:

  • m soft constraints \(s_1, \ldots , s_m\), each associated to an objective function \(w_j: \mathcal {X}\rightarrow \mathbb {R}\), \(j = 1, \ldots , m\), and

  • k hard constraints \(h_1, \ldots , h_k\) that must be satisfied.

Finding the most preferable (aka optimal or highest scoring) configuration \(\varvec{x}^*\) is accomplished by maximizing the total weight \(f_{\varvec{w}}(\varvec{x})\) of the soft constraints it satisfiesFootnote 3, as follows:

$$\begin{aligned} \varvec{x}^* = \mathop {\text {argmax}}\limits _{\varvec{x}}&\; f_{\varvec{w}}(\varvec{x}) := \sum _{j=1}^m w_j(\varvec{x}) \mathbbm {1}\left\{ \varvec{x}\,\models \,s_j\right\} \end{aligned}$$
(1)
$$\begin{aligned} \text {s.t.}&\; \varvec{x}\,\models \,h_j&\forall j = 1, \ldots , k \end{aligned}$$
(2)

The computational complexity of this inference problem depends on the type of constraints and objective functions appearing above, but in most cases of interest (like MAX-SAT below) it is NP-hard or beyond.

Although more general alternatives exist, such as semiring-based constraints [10], we will stick to our simple framework because: (1) it captures many prominent settings, from weighted maximum satisfiability (weighted MAX-SAT) up to optimization modulo theories [44], and (2) soft constraint theories in the above format can be easily learned with high-quality machine learning algorithms, as discussed next.

Learning Weighted MAX-SAT from Annotated Configurations

In weighted MAX-SAT, the soft constraints \(s_1, \ldots , s_m\) are arbitrary logic formulas and the per-constraint objective functions are constants \(w_j(\varvec{x}) \equiv w_j \in \mathbb {R}\). In the simplest case, no hard constraints are present. Inference boils down to finding a configuration \(\varvec{x}\) that maximizes the total weight of the satisfied formulas:

$$\begin{aligned} \max _{\varvec{x}} \; f_{\varvec{x}}(\varvec{x}) := \sum _{j=1}^m w_j \mathbbm {1}\left\{ \varvec{x}\,\models \,s_j\right\} \end{aligned}$$
(3)

This problem is notoriously NP-complete, but in can be solved efficiently in many practical cases [21].

Let us now consider perhaps the simplest possible learning scenario. We assume that there is a latent, unknown weighted MAX-SAT problem with parameters \(\varvec{w}^* \in \mathbb {R}^m\). We cannot observe this latent model, but we do know the dictionary of soft constraints \(s_1, \ldots , s_m\). Further, we are given a data set of example configurations \(\varvec{x}_1, \ldots , \varvec{x}_n\) annotated with their own scores according to the unknown theory, i.e., \(y_i = \sum _j w^*_j \mathbbm {1}\left\{ \varvec{x}_i\,\models \,s_j\right\} \) for all \(i = 1, \ldots , n\). The configurations \(\varvec{x}_i\) are assumed to be sampled at random according to some underlying distribution, and no implicit guarantee is given as for their quality.

The goal of learning is to induce a model \(\varvec{w}\) that behaves similarly to the latent one \(\varvec{w}^*\). For the time being, we will consider a model good so long as the estimated parameter vector \(\varvec{w}\) scores the examples similarly to the hidden model. Since we have access to the value of the true objective function \(y_i\) for all example configurations, finding an appropriate parameter vector can be cast as a regression problem, namely:

$$\begin{aligned} \textstyle \hat{\varvec{w}} \; = \mathop {\mathop {\text {argmin}}\limits }\nolimits _{\varvec{w}\in \mathbb {R}^m} \; \sum _{k=1}^n (y_i - f_{\varvec{w}}(\varvec{x}_i))^2 \end{aligned}$$
(4)

Now, notice that the function \(f_{\varvec{w}}(\varvec{x})\) is linear with respect to the basis defined by the indicator functions \(\{\mathbbm {1}\left\{ \varvec{x}\,\models \,s_j\right\} \,:\, j = 1, \ldots , m\}\). This means that Eq. 4 can be cast as linear regression and solved using standard regression techniques.

A very nice property of this setup is that linear regression works well even if the supervision \(y_i\) is moderately corrupted, and it provides a bridge to robust regression techniques for different kinds of noise.

One should keep in mind, however, that supervision on the per-instance scores \(y_i\) may not be readily available. This happens for instance when eliciting preferences from decision makers, who may be unable to state a numerical score. For this reason, we consider two alternative forms of supervision, namely pairwise rankings and input-output pairs, and show how to learn soft constraint theories from them.

Learning from Rankings

Let us start from pairwise rankings. In this case, we are given n pairs of configurations \(\{(\varvec{x}_i, \varvec{x}_i') : i = 1, \ldots , n\}\), where each pair is implicitly ranked according to the preference relation \(\varvec{x}_i \succeq \varvec{x}_i' \Leftrightarrow f_{\varvec{w}^*}(\varvec{x}_i) \ge f_{\varvec{w}^*}(\varvec{x}_i')\). The goal of learning is then to find a parameter vector \(\varvec{w}\) that ranks all of the example pairs correctly, or more formally:

$$\begin{aligned} \mathrm {find}&\; \varvec{w}\end{aligned}$$
(5)
$$\begin{aligned} \text {s.t.}&\; f_{\varvec{w}}(\varvec{x}_i) - f_{\varvec{w}}(\varvec{x}_i') \ge 0&\forall i = 1, \ldots , n \end{aligned}$$
(6)

The above constraint can be shown to be linear (in the indicators) by rewriting it as \(\sum _{j=1}^m w_j (\mathbbm {1}\left\{ \varvec{x}_i\,\models \,s_j\right\} - \mathbbm {1}\left\{ \varvec{x}_i'\,\models \,s_j\right\} ) \ge 0\). Notice that, even though the model is acquired from ranking data, we use it to compute high-scoring configurations, as per Eq. 3.

One issue with the above formulation is that simply conforming to the supervision does not guarantee that the learned vector \(\varvec{w}\) generalizes well to unseen pairs. This means that the learned function \(f_{\varvec{w}}\) may fail to rank the true optima above all other configurations. In order to address this, following principles from statistical learning theory [43, 55], it is customary to look for a vector \(\varvec{w}\) that correctly ranks all pairs by the largest possible margin, that is:

$$\begin{aligned} \max _{\varvec{w}, \mu \ge 0}&\; \mu \end{aligned}$$
(7)
$$\begin{aligned} \text {s.t.}&\; \mu \le f_{\varvec{w}}(\varvec{x}_i) - f_{\varvec{w}}(\varvec{x}_i')&\forall i = 1, \ldots , n \end{aligned}$$
(8)

where \(\mu \in \mathbb {R}\) measures the margin. It turns out that maximizing \(\mu \) is geometrically equivalent to minimizing the (squared) Euclidean norm of \(\varvec{w}\) ([43], Chap. 1), and so the above can be rewritten as the following quadratic convex optimization problem:

$$\begin{aligned} \min _{\varvec{w}}&\; \frac{1}{2} \Vert \varvec{w}\Vert _2^2 \end{aligned}$$
(9)
$$\begin{aligned} \text {s.t.}&\; f_{\varvec{w}}(\varvec{x}_i) - f_{\varvec{w}}(\varvec{x}_i') \ge 0&\forall i = 1, \ldots , n \end{aligned}$$
(10)

Finally, if the observations \(\varvec{x}_i\) are noisy or their rankings are inconsistent, as it the case when the examples define cycles like \(\varvec{x}_1 \succeq \varvec{x}_2 \succeq \ldots \succeq \varvec{x}_1\), then it may be impossible to find a non-zero vector \(\varvec{w}\) that simultaneously satisfies Eq. 10 for all examples. A common solution is to introduce slack variables \(\xi _i \in \mathbb {R}\) that measure the “degree of violation” for every example \(k = 1, \ldots , n\):

$$\begin{aligned} \min _{\varvec{w}, \varvec{\xi }}&\; \frac{1}{2} \Vert \varvec{w}\Vert _2^2 + \frac{\lambda }{2} \sum _{k=1}^s \xi _i \end{aligned}$$
(11)
$$\begin{aligned} \text {s.t.}&\; f_{\varvec{w}}(\varvec{x}_i) - f_{\varvec{w}}(\varvec{x}_i') \ge \xi _i&\forall i = 1, \ldots , n \end{aligned}$$
(12)

This is the well-known formulation of ranking support vector machine (SVM), a now classical machine learning algorithm for learning to rank, originally conceived for ranking results in search engines [23]. The constant \(\lambda \ge 0\) controls the trade-off between generalization (first term) and error on the training set (second term), and it is assumed to be given.

Although earlier works focused on solving the above optimization problem (OP) in the dual (cf. [40]), current state-of-the-art approaches rely on gradient-based optimization in the primal [46]. Practitioners need not worry about these details, since efficient (ranking) SVM solvers are included by default in most machine learning libraries, like scikit-learn [37] and Weka [19].

Learning from Input-Output Pairs

Input-output pairs are another popular form of supervision. Let \(\varvec{U}\) and \(\varvec{V}\) partition the set of variables (i.e., \(\varvec{X}= \varvec{U}\cup \varvec{V}\), \(\varvec{U}\cap \varvec{V}= \varnothing \)). The intuition is that the variables \(\varvec{U}\) act as inputs and thus are always known and fixed, while \(\varvec{V}\) are outputs and can be optimized over.

The data set, in this case, consists of n input-output pairs \(\{(\varvec{u}_i, \varvec{v}_i) : i = 1, \ldots , n\}\), where for any partial assignment \(\varvec{u}_i\), the output \(\varvec{v}_i\) is chosen optimally w.r.t. the latent parameters \(\varvec{w}^*\), that is:

$$\begin{aligned} \varvec{v}_i = \mathop {\text {argmax}}\limits _{\varvec{v}} \; f_{\varvec{w}^*}(\varvec{u}_i \circ \varvec{v}_i) \end{aligned}$$
(13)

Here \(\circ \) indicates vector concatenation. This kind of supervision is common in machine learning tasks that require to learn a map from structured inputs to structured outputs, like text parsing (where a sentence \(\varvec{u}\) is mapped to a parse tree \(\varvec{v}\)) or image segmentation (an image \(\varvec{u}\) is mapped to a set of labeled segments \(\varvec{v}\)), but it is also used in constraint learning, where the distinction between input and output variables is less well-defined [51].

In order to learn from input-output pairs, we adapt the training procedure of structured-output support vector machines (SSVM) [53], see [24] for a gentler introduction. Given an example \((\varvec{u}_i, \varvec{v}_i)\) and an arbitrary vector \(\varvec{w}\), let \(\varvec{v}'\) be the output of Eq. 4 when using \(\varvec{u}_i\) as input and \(\varvec{w}\) as parameters. Also, let \(\varDelta (\varvec{v}_i, \varvec{v}')\) be a distortion functionFootnote 4 that measures the difference between the correct output \(\varvec{v}_i\) and the predicted one \(\varvec{v}'\).

The intuition behind structured-output SVMs is that the vector \(\varvec{w}\) should be chosen so that, for any example, the predicted output has low distortion. This can be formalized as followsFootnote 5:

$$\begin{aligned} \max _{\varvec{w},\varvec{\xi }}&\; \frac{1}{2} \Vert \varvec{w}\Vert _2^2 + \frac{\lambda }{2} \sum _{k=1}^s \xi _i \end{aligned}$$
(14)
$$\begin{aligned} \text {s.t.}&\; f_{\varvec{w}}(\varvec{u}_i \circ \varvec{v}_i) - f_{\varvec{w}}(\varvec{u}_i \circ \varvec{v}'') \ge \varDelta (\varvec{v}_i, \varvec{v}'') - \xi _i&\forall \varvec{v}'' \ne \varvec{v}_i \, \forall i = 1, \ldots , n \end{aligned}$$
(15)

The objective function is the same as for ranking data. The constraint, on the other hand, is much more complex: it requires the correct output \(\varvec{v}_i\) to be scored higher than any alternative output \(\varvec{v}'' \ne \varvec{v}_i\) by a margin proportional to the distortion. This takes care of enforcing an appropriate margin between \(\varvec{v}_i\) and the predicted output \(\varvec{v}'\) too. The per-example slacks \(\xi _i\) allow for scoring mistakes in noisy data sets.

The similarity to learning to rank is striking, but solving this optimization problem is trickier, because the number of alternative outputs \(\varvec{v}'\) can be very (exponentially) large. This means that Eq. 15 has to be enforced over an enormous amount of configurations. In order to solve this OP, the most straight-forward approach is to employ cutting planes, which we will not discuss. We refer the interested reader to [24] instead.

Learning more General Constraint Theories

A striking property of the OPs described in the previous sections is that they are relatively agnostic to the particular choice of constraints and per-constraint objective functions. Indeed, regression-based learning has been used to learn arbitrary weighted CSPs [41], and SSVM-based learning for learning Optimization Modulo Theories [51]. In this last case, the restriction that the per-constraint objective functions \(w_j(\varvec{x})\) are constant is lifted—although the learning algorithm remains essentially unchanged.

5 Interactive Learning

Applications where supervision is scarce and expensive are not well suited for offline learning, because there are often too few examples to learn a reasonably accurate model. In this case, a sensible thing to do is to acquire supervision directly from an oracle by asking informative questions. This allows the learning algorithm to optimize the performance/example ratio, and thus to acquire good models at a small cost.

Notice that the oracle may be a human subject, like when eliciting constraints (knowledge) from a domain expert [41] or preferences from an end-user [39], or a full-fledged scientific apparatus, as in automated scientific experiments [25]. In the first case, the goal is to extract a (soft or hard) constraint theory from a domain expert, who is otherwise unable to formalize and model her knowledge upfront. The second case captures applications like interactive recommender systems, where very little expertise (or patience!) can be expected of the human counterpart, and any request for supervision has to be designed so to be easy to understand and answer.

The main questions that arise when designing an interactive learning algorithm are of course what kind of questions should be asked to the oracle, and how to pick good questions. The answer to both questions is very application- and oracle-specific, as we will see in the following.

Interactive Learning of Hard Constraints

The most straightforward approach to learning hard constraints is to trivially select each candidate constraint \(h_j\) in turn, \(j = 1, \ldots , m\), and ask the oracle whether it appears in the latent constraint theory. Unfortunately, this naive approach is not very useful, as it requires to consider all constraints, which can be exponentially many in the number of variables. This procedure is therefore unfeasible even for theories of modest size, especially if the oracle is a human being.

A more appropriate procedure is to use membership constraints. In this setting, the learner chooses an instance \(\varvec{x}\) and asks the user whether it satisfies the hidden theory or not. This setup stands at the core of query-based learning [4], a venerable and sound approach to learning. Alternative query types will be considered later on.

Now, what is the best way to choose the query configuration \(\varvec{x}\)? Before proceeding, let us assume a realizable scenario, i.e., that (a) the set of hypotheses includes the latent concept, and (b) that the oracle always answers correctly. This is the case, for instance, if the learning problem is well engineered and the oracle is a human domain expert, e.g., an employee who has a vested interest in answering the questions asked by the algorithm. Under these (rather strong) assumptions, one can resort to halving approaches, which roughly work as follows.

Learning is interactive. At all iterations, the learner keeps track of the set of candidate theories that are consistent with respect to all answers observed so far—i.e., the version space. The intuition is that, in each iteration, the algorithm chooses a configuration \(\varvec{x}\) so that, regardless of the answer to the membership query, the version space is reduced as much as possible. In the best possible scenario, the version space halves at each iteration, and therefore ideally the number of queries necessary to find the correct concept is approximately \(\log _2 |\mathcal {H}|\), where \(\mathcal {H}\) is the set of candidate hypotheses.

Unfortunately, this theoretically appealing approach has several flaws. First, keeping track of the version space can be quite complicated. The best approaches to date make use of rather convoluted schemas [8] or only store the most general and most specific candidate theories in the version space [9]. Second, it turns out that in many interesting cases, choosing the (approximately) optimal instance \(\varvec{x}\) is computationally intractable [31]. A simple approximation to this schema is to choose an instance \(\varvec{x}\) that reduces the version space by some amount. This is accomplished in [9] by choosing an instance whose feasibility the most general and the most specific theories in the current version space disagree on. Therefore, if it turns out that \(\varvec{x}\) is feasible, the most specific hypothesis is wrong and it must be generalized. On the other hand, if \(\varvec{x}\) is actually unfeasbile, the converse is true and the most general hypothesis must be specialized. Thus, this strategy guarantees that the version space reduces at each iteration, and that it eventually contains only concepts that match all examples.

Interactive Learning of Soft Constraints

Consider the following toy application: A user wishes to buy a custom PC. The PC is assembled from individual components: CPU, HDD, RAM, etc. Valid PC configurations must satisfy constraints, e.g. CPUs only work with compatible motherboards [50]. In this setting, one is tasked with constructing a PC configuration \(\varvec{x}\) that is both palatable to the customer and compatible with any hard constraints, e.g., that the CPU and the motherboard should be compatible.

As above, we cast this kind of problems as learning a weighted CSP that captures the preferences of the customer, and that can be used to generate high-scoring

In the following, we will consider three queries of three common kinds:

  • Scoring queries. In this case, the oracle is asked to provide the true numerical score of configuration \(\varvec{x}\) chosen by the learner. Queries of this type make sense when interacting with very precise oracles, such as measurement devices in automated scientific experiments [25], but not as much when interacting with human oracles. Indeed, it is very difficult—even for domain experts—to provide precise or approximate numerical scores. This is why most scoring systems use discrete ratings, such as star ratings. Regardless, depending on the application, other query types may be easier to answer.

  • Ranking queries. In this case, the query consists of two unlabeled configurations \(\varvec{x}\) and \(\varvec{x}'\), and the oracle is tasked with indicating the most preferable of the two, i.e., whether \(f_{\varvec{w}^*}(\varvec{x}) \ge f_{\varvec{w}^*}(\varvec{x}')\) holds. These queries have found ample application in interactive preference elicitation and recommendation tools [39].

  • Improvement queries. In this case, the algorithm chooses a single configuration \(\varvec{x}\) and asks the oracle to provide an improved version \(\bar{\varvec{x}}\). Notice that the improvement is allowed to be small or partial [47]. It is easy to see that the pair \((\bar{\varvec{x}}, \varvec{x})\) implicitly defines a pairwise ranking of the form \(\bar{\varvec{x}} \succeq \varvec{x}\), and so the supervision is the same as for ranking queries. It is important to notice that, however, the interaction itself is different. Indeed, improvement queries only require the human oracle to observe and analyze a single configuration (rather than two), and synergize very well with direct manipulation interfaces like those used in computer-assisted design.

Notice that scoring queries lead to collecting a large data set of scoring information, and thus learning can be cast in terms of linear regression. In the other two cases, the collected data set contains pairwise ranking information, and therefore learning boils down to learning to rank. Thus the techniques discussed in the previous section can be immediately applied to the interactive case: it is just a matter of solving a regression, ranking, or structured-output learning every time a new answer is achieved. Although not entirely principled, this approach tends to work well in practice.

The question is, again, how to pick the right query. Ideally, the most informative question should be chosen. Unfortunately, evaluating the true informativeness of a query based on the information available during learning is tricky. A very simple solution is to choose a configuration \(\varvec{x}\) (or a pair of configurations) that is maximally uncertain according to the model. More specifically, an instance is uncertain if the entropy of the response variable (e.g. of the score, for scoring queries) is large. Measures based on the margin are also common [45]. The major down-side of uncertainty sampling is that the uncertainty provided by the model may be misleading, e.g., for instance if the learned theory is “overconfident”. It is therefore customary to combine uncertainty sampling with other strategies that lessen its reliance on the model’s estimates [45].

6 Further Reading

Learning and Optimization. The interplay between machine learning and satisfaction/optimization is not limited to constraint learning. In most of machine learning (excluding a large chunk of its Bayesian side), learning is framed as the task of optimizing some regularized loss function with respect to the data set [55]—hence the centrality of optimization in ML. Nowadays, the most popular solution approach are gradient descent techniques [29], but one should not forget that there are a plethora of valid alternatives, cf. [11, 49]. Some learning frameworks go one step further, and make use of declarative constraint programming to model and solve learning tasks like program synthesis and pattern mining [2, 18]. Another link between ML and optimization occurs in probabilistic graphical models, structured-output prediction, and constraint learning, where computing a prediction for an unseen instance is itself an optimization problem. There is also abundant literature on speed-up learning, i.e., on leveraging machine learning techniques for improving the run-time of satisfaction and optimization solvers; see for instance [22] for a method to accelerate branch & bound in mixed-integer linear programming solvers. It is easy to spot a recursion here: machine learning could in principle be used to accelerate an optimization step which is itself part of a learning problem. While true, diminishing returns make it difficult to exploit this loop.

Learning Hard Constraints. The task of acquiring hard constraints is surprisingly close to concept learning [54] and binary classification: in all of these, the learner has to acquire a hidden concept from examples. The main difference is whether the model in question is a constraint theory, and whether it is used in a purely predictive manner or also for inspection, interpretation, debugging, etc. A major advantage of constraints is that they can be verified by domain experts either manually or with appropriate tools—and modified, if necessary. Verification of models learned from data is often a required to guarantee the proper functioning of the model [3], and indeed verification techniques are being actively researched for non constraint-based models like neural networks [13].

From a search perspective, learning hard constraints is also closely related to feature selection, pattern mining, and structure learning of probabilistic graphical models [28]—e.g. Bayesian networks and Markov networks. All these tasks revolve around efficiently encoding and searching a very large set of candidates, and often employ similar techniques, e.g., version spaces, variants of grafting [38], and syntax-guided synthesis [6], albeit often under different names. Inductive logic programming, where the task is to learn first-order theories [34], can be handled analogously, but it involves even larger search spaces.

Learning soft constraints. As for soft constraints, we focused on learning techniques rooted in statistical learning theory and empirical risk minimization [55] and max-margin methods like support vector machines [43]. Learning from input-output examples stands at the core of structured-output prediction [24] and learning to search [14]. One link that is seldom made is to inverse (combinatorial) optimization, where the goal is to adjust a pre-existing (combinatorial) optimization model such that it adheres to a set of known optimal solutions. Unsurprisingly, this field has slowly been drifting closer to the structured-output setting, and these two problems have recently been tackled using similar strategies, see for instance [47] and [16].