Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The traditional approach to data mining is to develop specialized algorithms for specific tasks. This has led to specialized algorithms for many tasks, among which classification, clustering and association rule discovery [19, 30, 35]. In many cases these algorithms support certain kinds of constraints as well; in particular constraint-based clustering and constraint-based pattern mining are established research areas [2, 5, 6]. Even though successful for specific applications, the downside of specialized algorithms is that it is hard to adapt them to novel tasks.

In recent years, researchers have explored the idea of using generic solvers to tackle data mining problems such as itemset mining [16, 22], sequence mining [7, 31] and clustering [11, 13]. These approaches start from the insight that many data mining problems can be formalized as either a constraint satisfaction problem, or a constrained optimization problem. The advantage is that, just as in constraint programming, new tasks can be addressed by changing the constraint specification.

Although these techniques allow flexibility in modeling new tasks, they are often tied to one particular solver. To address this shortcoming we introduced MiningZinc [15], a solver-independent language and framework for modeling constraint-based data mining tasks. That work focussed on constraint-based itemset mining problems and the solving capabilities of the framework. In this work, we focus on the modeling aspects of the MiningZinc language and we show for a wide range of data mining tasks how they can be modeled in MiningZinc, namely sequence mining, Bayesian pattern mining, linear regression, clustering data factorization and ranked tiling. We end with a discussion of related work and conclusion.

2 Language

Ideally, a language for mining allows one to express the problems in a natural way, while at the same time being generic enough to express additional constraints or different optimization criteria. We choose to build on the MiniZinc constraint programming language for this purpose [32]. It is a subset of Zinc [25] restricted to the built-in types bool, int, set and float, and user-defined predicates and functions [34].

MiningZinc uses the MiniZinc language, that is, all models written for MiningZinc are compatible with the standard MiniZinc compiler and toolchain. However, MiningZinc offers additional functionality aimed towards modeling data mining problems:

  • a library of predicates and functions that are commonly used in data mining

  • extensions for reading data from different sources (e.g. a database)

  • integration with Python for easy implementation of greedy algorithms

2.1 MiniZinc

MiniZinc is a language designed for specifying constraint problems. Listing 1 shows an example of a MiniZinc model for the well-known “Send+More=Money” problem. In this problem the goal is to assign distinct digits to the letters such that the formula holds.

This model starts with declaring the decision variables with their domains (Lines 1 and 2). The problem specification states that the variables S and M should not be zero which we encode in their domains. Next, we specify the constraints on these decision variables. On Line 4 we specify that all variables should take a different value. For this we use the all_different global constraint which is defined in MiniZinc’s standard library. In order to use this constraint we include that library (Line 3). On Line 5 we specify that the numbers formed by the digits “SEND” and “MORE” should sum up to the number formed by the digits “MONEY”. For the translation between the list of digits and the number they represent, we define a helper function on Line 6; it first creates a local parameter max_i that represents the largest index, and then sums over each variable multiplied by 1, 10, 100, ... depending on its position in the array (for example, number([S,E,N,D]) = 1000 \(*\) S+100 \(*\) E+10 \(*\) N+1 \(*\) D). Line 7 states that this is a constraint satisfaction problem. MiniZinc also supports optimization problems in which case this statement would be replaced by solve minimize <variable expression>, or likewise with maximize. Finally, Line 8 defines the output of the model.

figure a
Fig. 1.
figure 1

Overview of the MiniZinc toolchain

Apart from the functionality demonstrated in the example, MiniZinc models can be parameterized, for example, to include external data. The values of these parameters can be loaded from an external file, or passed in through the command line.

MiniZinc is solver-independent, that is, MiniZinc models can be translated to the lower level language FlatZinc which is understood by a wide range of solvers. The MiniZinc toolchain does however support solver-specific optimizations through the use of solver-specific versions of the standard library and annotations. A schematic overview of the MiniZinc toolchain is shown in Fig. 1.

In the following we describe how we extended MiniZinc.

2.2 Library

MiniZinc offers the ability to add additional libraries of commonly used predicates and functions. As part of MiningZinc, we created a minimal library for help with specifying mining and learning problems. It has two purposes: (1) to simplify modeling for the user and (2) to simplify the model analysis by the MiningZinc solver.

There are four categories of functions:

  • generic helper functions

  • itemset mining functions

  • norms and distance functions

  • extra solver annotations

There are two generic helper functions that we feel are missing in MiniZinc, namely a direct bool2float conversion function var float: b2f(var bool: B) and an explicit weighted sum:

function var int: weighted_sum(array[int] of var int W, array[int] of var int X).

Itemset mining is the best studied problem category in MiningZinc, and the key abstraction is the cover function: cover(Items, TDB)). Other helper functions are coverInv(Trans, TDB)) and frequent_items(TDB, MinFreq).

Many data mining and machine learning problems involve computing distances. For this reason, we added to the library functions that compute the \(l_1\), \(l_2\) and \(l_\infty \) norms, the Manhattan, Euclidean and Chebyshev distance as well as the sum of squared errors and mean squared error:

figure b
figure c

Finally, the library also declares a few annotations that provide additional information to the solver such as load_data and vartype, which are discussed in the next section.

The MiningZinc library can be used by adding the following statement.

figure d

The library is written in MiniZinc and is fully compatible with the standard MiniZinc toolchain.

2.3 Facilities for Loading Data

The MiningZinc library also declares a few annotations that allow us to extend the functionality of MiniZinc with respect to loading data from external sources. This consists of two components: (1) accessing data from an external data source and (2) translating it to a data structure supported by MiniZinc.

When using standard MiniZinc, if one wants to use external data, the workflow would be as follows:

  1. 1.

    Determine the relevant information from the database

  2. 2.

    Use SQL to extract this information

  3. 3.

    Translate the data to numeric identifiers using a script

  4. 4.

    Write out the data into MiniZinc format (dzn) using a script

  5. 5.

    Execute MiniZinc

  6. 6.

    Translate the results’ identifiers back to the original data using a script

  7. 7

    Analyze the results and, if necessary, repeat the process

Using the data loading facilities available in MiningZinc, the workflow becomes:

  1. 1.

    Determine the relevant information from the database

  2. 2.

    Update the MiningZinc model with data sources pointing to the relevant information

  3. 3.

    Execute MiningZinc

  4. 4.

    Analyze the results and, if necessary, repeat the process

Reading Data. MiningZinc facilitates loading data from external sources through the load_data(specification) annotation. The specification is a string describing the data source. By default, MiningZinc supports the following specifications:

  • sqlite;<filename>;<SQL query> Retrieve data from an SQLite database based on an SQL query.

  • arff;<filename> Retrieve data from a file in Weka’s ARFF format [18].

  • csv;<filename>;<field separator> Retrieve data from a CSV file.

The use of these annotations is illustrated in Listing 4.

figure e

The translation process is determined based on the structure of the input data and the target type in MiniZinc. For example, given an input table with two columns and the output type array[int] of set of int, the first column is interpreted as the index of the array and the second column as an element of the set. This is illustrated in Fig. 2.

Fig. 2.
figure 2

Default translation to an array[int] of set of int from a table with two columns.

Automatic Translations. The previous example shows that during the loading of the data, we need to translate some of the data to an integer range. MiningZinc performs these translations automatically. The user can guide this translation process by adding type annotations to variable definitions. This can be done using the vartype annotation as illustrated in Listing 5. The additional information allows MiningZinc to translate the solutions back to the original values.

figure f

2.4 Python Integration

MiniZinc is a language that is specifically designed for expressing constrained optimization problems. It is therefore not very suitable to write complete systems, but should be seen as a domain specific language instead. To facilitate the use of MiningZinc we provide an interface with Python through the mngzn module. This interface allows the user to parse a MiningZinc model from a Python string, provide parameters using Python’s dictionaries and query its results as a Python data structure. The main interface is demonstrated in Listing 6.

figure g

First, we load the MiningZinc package (Line 1) and we define a model as a string of MiniZinc code (Line 2). The model takes two parameters sum and max and finds all pairs of integers up to max that sum to sum. On Line 3 we set the values of these parameters in a Python dictionary. Next, we parse the model string together with the parameters to obtain a solvable model (Line 4). On Line 5 we solve the model and obtain the solutions. This returns a sequence of Python dictionaries containing the output variables of the model, in this case [{’a’: 1, ’b’: 2}, {’a’: 2, ’b’: 1}]. Finally, we format and print out each solution (Line 7).

In Sect. 3.7 (Listing 17) we show an example of how this interface can be used to implement a greedy algorithm.

3 Modeling Data Mining Problems

We show how to model a variety of data mining problems, including constraint-based itemset mining, sequence mining, clustering, linear regression, ranked tiling and more.

In each case, the problem is modelled as a standard constraint satisfaction or optimisation problem, and it is modelled using the primitives available in MiniZinc, as well some common functions and predicates that we have added to the MiningZinc library.

3.1 Itemset Mining

The MiningZinc work originates from our work on using constraint programming (solvers) for constraint-based itemset mining [16].

Problem Statement. Itemset mining was introduced by Agrawal et al. [1] and can be defined as follows. The input consists of a set of transactions, each of which contains a set of items. Transactions are identified by identifiers \(\mathcal{S}= \{1, \ldots , n\}\); the set of all items is \(\mathcal{I} = \{1, \ldots , m\}\). An itemset database \(\mathcal{D}\) maps transaction identifiers to sets of items: \(\mathcal{D}(t)\subseteq \mathcal{I}\). The frequent itemset mining problem is then defined mathematically as follows.

Definition 1

(Frequent Itemset Mining). Given an itemset database \(\mathcal{D}\) and a threshold \(\alpha \), the frequent itemset mining problem consists of finding all itemsets \(I \subseteq \mathcal{I}\) such that \(| \phi _\mathcal{D}(I) | > \alpha \), where \(\phi _\mathcal{D}(I) = \{ t \,|\, I\subseteq \mathcal{D}(t)\}\).

The set \(\phi _\mathcal{D}(I)\) is called the cover of the itemset, and the threshold \(\alpha \) the minimum frequency threshold. An itemset I which has \(| \phi _\mathcal{D}(I) | > \alpha \) is called a frequent itemset.

figure h
figure i

MiningZinc Model. Listing 7 shows the frequent itemset mining problem in MiningZinc. Lines 2 and 3 are parameters and data, which a user can provide separate from the actual model or through load_data statements. The model represents the items and transaction identifiers in \(\mathcal{I}\) and \(\mathcal{S}\) by natural numbers (from 1 to NrI and 1 to NrT respectively) and the dataset \(\mathcal{D}\) by the array TDB, mapping each transaction identifier to the corresponding set of items. The set of items we are looking for is represented on line 5 as a set variable with elements between value 1 and NrI. The minimum frequency constraint is posted on line 7; it naturally corresponds to the formal notation \(| \phi _\mathcal{D}(I) | \ge \alpha \). The cover relation used on line 7 and shown in Listing 8 is part of the MiningZinc library and implements \(\phi _\mathcal{D}(I) = \{ t | I\subseteq \mathcal{D}(t)\}\); note that this constraint is not a hard-coded constraint in the solver, such as in other systems, but is implemented in the MiningZinc language itself, and can hence be changed if this is desired. Finally, line 8 states that it is a satisfaction problem. Enumerating all solutions that satisfy the constraints corresponds to enumerating all frequent itemsets.

This example demonstrates the appeal of using a modeling language like MiniZinc for pattern mining: The formulation is high-level, declarative and close to the mathematical notation of the problem. Furthermore, the use of user-defined functions allows us to abstract away concepts that are common when expressing constraint-based mining problems.

Constraint-Based Mining. In constraint-based mining the idea is to incorporate additional user-constraints into the mining process. Such constraints can be motivated by an overwhelming number of (redundant) results otherwise found, or by application-specific constraints such as searching for patterns with high profit margins in sales data.

Listing 9 shows an example constraint-based mining setting. Compared to Listing 7, two constraints have been added: a closure constraint on line 6, which avoids non-closed patterns in the output, and a minimum cost constraint 10, requiring that the sum of the costs of the individual items is above a threshold. Other constraints could be added and combined in a similar way. See [16] for the range of constraints that has been studied in a constraint programming setting.

figure j

3.2 Sequence Mining

Sequence mining [1] can be seen as a variation of the itemset mining problem discussed above. Whereas in itemset mining each transaction is a set of items, in sequence mining both transactions and patterns are ordered, (i.e. they are sequences instead of sets) and symbols can be repeated. For example, \({\mathrm {\langle {b,a,c,b} \rangle }} \) and \({\mathrm {\langle {a,c,c,b,b} \rangle }} \) are two sequences, and the sequence \({\mathrm {\langle {a,b} \rangle }} \) is one possible pattern included in both.

Problem Statement. Two key concepts in any pattern mining setting are the structure of the pattern, and the cover relation that defines when a pattern covers a transaction.

In sequence mining, a transaction is covered by a pattern if there exists an embedding of the sequence pattern in the transaction; where an embedding is a mapping of every symbol in the pattern to the same symbol in the transaction such that the order is respected.

Definition 2

(Embedding in a sequence). Let \(S = \langle s_1,\ldots , s_m \rangle \) and \(S' = \langle s'_1, \ldots , s'_n\rangle \) be two sequences of size m and n respectively with \(m\le n\). The tuple of integers \(e = (e_1, \ldots , e_m)\) is an embedding of S in \(S'\) (denoted \(S \sqsubseteq _e S'\)) if and only if:

$$\begin{aligned} S \sqsubseteq _e S' \leftrightarrow e_1< \ldots < e_m ~\text {and}~ \forall i \in 1,\ldots ,m: s_i = s'_{e_i} \end{aligned}$$
(1)

For example, let \(S={\mathrm {\langle {a,b} \rangle }} \) be a pattern, then (2, 4) is an embedding of S in \({\mathrm {\langle {b,a,c,b} \rangle }} \) and (1, 4), (1, 5) are both embeddings of S in \({\mathrm {\langle {a,c,c,b,b} \rangle }} \).

Given an alphabet \(\Sigma \) of symbols, a sequential database D is a set of transactions where each transaction is a sequences defined over symbols in \(\Sigma \). As in itemset mining, let \(\mathcal{D}\) be a mapping from transaction identifiers to transactions. The frequent sequence mining problem is then defined as follows:

Definition 3

(Frequent Sequence Mining). Given a sequential database \(\mathcal{D}\) with alphabet \(\Sigma \) and a threshold \(\alpha \), the frequent sequence mining problem consists of finding all sequences S over alphabet \(\Sigma \) such that \(| \psi _\mathcal{D}(S) | > \alpha \), where \(\psi _\mathcal{D}(S) = \{ t \,|\, \exists e ~\text {s.t.}~ S \sqsubseteq _e \mathcal{D}(t)\}\).

The set \(\psi _\mathcal{D}(S)\) is the cover of the sequence, similar to the cover of an itemset.

MiningZinc Model. Modeling this in MiningZinc is somewhat more complex than itemset mining, as for itemsets we could reuse the set variable type, while sequences and the embedding relation need to be encoded. Each symbol is given an identifier (offset 1), and a transaction is represented as an array of symbol identifiers. The data is hence represented by a two dimensional array, and all sequences are padded with the identifier 0 such that they have the same length (MiniZinc does not support an array of arrays of different length). This data is given in lines (2)–(7) in Listing 10.

The pattern itself is also an array of integers, representing symbol identifiers. The 0 identifier can be used as padding at the end of the pattern, so that patterns of any size can be represented. Line (9) represents the array of integer variables while line (11)–(13) enforce the padding meaning of the 0 identifier.

To encode the cover relation we can not quantify over all possible embeddings e explicitly, as there can be an exponential number of them. Instead, we add one array of variables for every transaction that will represent the embedding of the pattern in that transaction, if one exists (line 15). Furthermore, we add one Boolean variable for every transaction, which will indicate whether the embedding is valid, e.g. whether the transaction is covered (line 17). Using these variables, we can encode the cover relation (line 19), explained below, as well as that the number of covered transactions must be larger than the minimum frequency threshold (line 21).

figure k

The actual formulation of the cover relation is shown in Listing 11; it could be made available as a function returning a set variable too. The formulation consists of three parts. In the first part (line 6) we constrain that for each transaction, the jth embedding variable must point to a position in the transaction that matches the symbol of the jth symbol in the pattern. Note that if no match is possible then the embedding variable will only have symbol 0 in its domain. The second part (line 9) requires that embedding variables must be increasing (except when 0). Finally, on line 12 we state that a transaction is covered if for every non-0 valued position in the pattern there is a matching embedding variable.

figure l

As in sequence mining, extra constraints can be added to extract fewer, but more relevant or interesting patterns. An overview of sequence mining constraints that have been studied in a sequence mining setting is available in [31].

3.3 Constraint-Based Pattern Mining in Bayesian Networks

Just as one can mine patterns in data, it is also possible to mine patterns in Bayesian Networks (BNs). These patterns can help in understanding the knowledge that resides in the network. Extracting such knowledge can be useful when trying to better understand a network, for example when presented with a new network, in case of large and complex networks or when it is updated frequently.

Problem Statement. A Bayesian Network \(\mathcal {G}\) defines a probability distribution over a set of random variables \(\mathcal {X}\). Each variable has a set of values it can take, called its domain. We define a Bayesian network pattern as an assignment to a subset of the variables:

Definition 4

(BN pattern). A pattern A over a Bayesian network \(\mathcal {G}\) is a partial assignment, that is, an assignment to a subset of the variables in \(\mathcal {G}\): \(A = \{ (X_1 =x_1),\ldots ,(X_m = x_m) \}\), where each \(X_i\) is a different variables and \(x_i\) is a possible value in its domain.

A BN pattern can be seen as an itemset, where each item is an assignment of a variable to a value. One can compute the (relative) frequency of an itemset in a database; related, for a BN pattern one can compute the probability of the pattern in the Bayesian network. The probability of a pattern A, denoted by \(P_{\mathcal {G}}(A)\), is \(P( (X_1 = x_1),\ldots ,(X_m = x_m) )\), that is, the probability of the partial assignment marginalized over the unassigned variables.

Given this problem setting, one can define a range of constraint-based pattern mining tasks over Bayesian Networks, similar to constraint-based mining tasks over itemsets or sequences. In line with frequent itemset mining, the following defines the probable BN pattern mining problem:

Definition 5

(Probable BN pattern Mining). Given a Bayesian network \(\mathcal {G}\) over variables \(\mathcal {X}\) and a threshold \(\alpha \), the probable BN pattern mining problem consists of finding all BN patterns A over \(\mathcal {X}\) such that \(P_{\mathcal {G}}(A) > \alpha \).

MiningZinc Model. We encode a BN pattern with an array of integer CP variables, as many as there are random variables in the BN. Each CP variable has m+1 possible values, where m is the number of values in the domain of the corresponding random variable: value 0 represents that the variable is not part of the partial assignment, e.g. it should be marginalized over when computing the probability. The other values each correspond to a value the domain of the random variable.

The main issue is then to encode the probability computation. As this computation will be performed many times during search, we choose to first compile the BN into an arithmetic circuit. Computing the probability over the circuit is polynomial in its size (which may be worst-case exponential to the size of the origin network) [8]. Nevertheless, using ACs is generally recognized as one of the most effective techniques for exact computation of probabilities, especially when doing so repeatedly.

Fig. 3.
figure 3

Left: The Bayesian network and CPTs for a distribution with 3 variables. The domain for all variables is \(\{1,2\}\). Right: The compiled arithmetic circuit for this BN. A \(\lambda _{ij}\) leaf of in the AC corresponds to assignment \(X_i = j\) in the BN.

Figure 3 shows an example AC. It consists of product nodes, sum nodes, constants and indicator variables \(\lambda \). The Boolean indicator variables \(\lambda _{i,j}\) indicate whether \((X_i = j)\). An assignment sets the corresponding indicator variable to 1 and the other indicator variables of the random variable to 0. To marginalize a random variable away, all indicator variables of that variable must be set to 1. The value obtained in the root node after evaluating the arithmetic circuit corresponds to the probability of the pattern given the assignment to the indicator variable nodes.

To encode this in CP, we will create a float variable for every node in the AC. Listing 12 shows how to encode an AC in CP. We assume given the set of product and sum node identifiers (root node has identifier 1), an array of sets representing the ’children’ relation, an array with the constants, and a 2D array that maps the indicator variables to nodes of the tree. The last two constraints in the listing provide a mapping from the Q variables to indicator variables, such that they must all be set to 1 (=marginalize) or exclusively have one indicator set to 1 (=assignment).

figure m

Many constraints from the itemset mining literature have a counterpart over BN patterns and can be formulated as well, such as size constraints, closed/maximal/free constraints and constraints to discriminate results from two networks from each other, or to discriminate a probability in a network to relative frequency in a dataset. This is currently work in progress.

3.4 Linear Regression

A common problem studied in data mining is regression, where the goal is to estimate the value of a variable based on the values of dependent variables.

Problem Statement. In simple regression, the goal is to predict the value of a target attribute given the value of an M-dimensional vector of input variables \(x = (x_{1}, ..., x_{M})\).

We are given a set of N observations \(\mathbf {X}\) and their corresponding target values y. The goal is to find a function \(\hat{y}(x)\) that approximates the target values y for the given observations. In linear regression [3] we assume \(\hat{y}(x)\) to be a linear function. Mathematically, such a function can be formulated as

$$\begin{aligned} \hat{y}(x) = w_1 x_1 + ... + w_M x_M + w_{M+1} \end{aligned}$$

where \(w = (w_1, ..., w_{M+1})\) is a vector of weights that minimizes a given error function. This error function is typically defined as the sum of squared errors

$$\begin{aligned} sumSqErr (\hat{y}, y) = \Vert \mathbf {X}w - y \Vert ^2_2, \end{aligned}$$

where \(\mathbf {X}\) is an \(N \times (M+1)\) matrix where each row corresponds to a given observation (extended with the constant 1), and y is a vector of length N containing the corresponding target values. The vector \(\mathbf {X}w\) contains the result of computing \(\hat{y}\) for each observation. The goal is then to find the vector of weights w that minimizes this error, that is,

$$\begin{aligned} \arg \min _w \Vert \mathbf {X}w - y \Vert ^2_2. \end{aligned}$$

MiningZinc Model. We can formulate this problem as an optimization problem as shown in Listing 13. The model starts with defining the input data (Lines 2 and 3) and its dimensions (Lines 5–6). The input data can be specified in an external file. Line 9 specifies the weight vector that needs to be found. Based on this weight vector and the input variable x, we specify the estimate (\(\hat{y}(x)\)) of the linear model in Line 11. Finally, on Line 13 we specify the optimization criterion, i.e. to minimize the sum of squared errors. The function sumSqErr is defined in the MiningZinc library (Listing 3).

figure n

By replacing the error function we can easily model other linear regression tasks, for example linear regression with elastic net regularization [38] where the optimization criterium is defined as

$$\begin{aligned} \arg \min _w \frac{1}{2n_ samples } \Vert \mathbf {X}w - y ||^2_2 + \alpha \rho \Vert w \Vert _1 + \frac{\alpha (1-\rho )}{2} \Vert w \Vert ^2_2 \end{aligned}$$

with \(\alpha \) and \(\rho \) parameters that determine the trade-off between L1 and L2 regularization. Listing 14 shows the implementation of this scoring function in MiniZinc.

figure o

3.5 Clustering

The task of clustering is discussed in the next chapter. We would like to point out however that the clustering problems explained there can be modeled in MiningZinc too.

Problem Statement. Let us consider the minimum sum of squared error clustering problem (which k-means provides an approximation of), where the goal is to group all examples into k non-overlapping groups [21]. The objective to minimize is the ’error’ of each cluster, that is, the distance of each point in the cluster to the mean (centroid) of that cluster.

The centroid of a cluster can be computed by computing the mean of the data points that belong to it:

$$\begin{aligned} z_C = \text {mean}(C) = \dfrac{ \sum _{p \in C} p }{|C|} \end{aligned}$$
(2)

The error is then measured as the sum of squared errors of the clusters:

$$\begin{aligned} \sum _{C \in \mathcal{C}} \, \sum _{p \in C} d^2(p, z_C) \end{aligned}$$
(3)

MiningZinc Model. The model below shows a MiningZinc specification of this problem. As variables, it uses an array of integers, one integer variable for every example. This variable will indicate which cluster the example belongs too. The optimisation criterion is specified over all clusters and examples; the b2f(Belongs[j])==i) part converts the Boolean valuation of whether point j belongs to cluster i into a float variable, such that this indicator variable can be multiplied by the sum of squared errors of point j to cluster i.

The functions b2f() (bool 2 float) and sumSqErr() (sum of squared errors) are part of the MiningZinc library, see Sect. 2.2. The definition of mean() is shown below and follows the mathematical definition above.

figure p

More clustering problems and how to model them for use with constraint solvers can be found in the next chapter.

3.6 Relational Data Factorization

Motivated by an analogy with Boolean matrix factorization [29] (cf. Fig. 4), [9] introduces the problem of factorizing a relation in a database. In matrix factorization, one is given a matrix and has to factorize it as a product of other matrices.

Fig. 4.
figure 4

Boolean matrix factorization.

Problem Statement. In relational data factorization (RDF), the task is to factorize a given relation as a conjunctive query over other relations, i.e., as a combination of natural join operations. The problem is then to compute the extensions of these relations starting from the input relation and a conjunctive query. Thus relational data factorization is a form of both inverse querying and abduction, as one has to compute the relations in the query from the result of the query. The result of relational data factorization is not necessarily unique, constraints on the desired factorization can be imposed and a scoring function is used to determine the quality of a factorization. Relational data factorization is thus a constraint satisfaction and optimization problem.

More specifically, relational data factorization is a generalization of abductive reasoning [10]:

  1. 1.

    instead of working with a single observation f, we now assume a set of facts D for a unique target predicate db is given;

  2. 2.

    instead of assuming any definition for the target predicate, we assume a single definite rule defines db in terms of a set of abducibles A, the conjunctive query;

  3. 3.

    instead of assuming that a minimal set of facts be abduced, we score the different solutions based on the observed error.

Formally we can specify the relational factorization problem as follows:

Given:

  • a dataset D (of ground facts for a target predicate db);

  • a factorization shape Q: \(\textit{db}(\bar{T}) \leftarrow q_1(\bar{T}_1), \dots , q_k(\bar{T}_k)\), where some of the \(q_i\) are abducibles;

  • a set of rules and constraints P;

  • a scoring function \(\textit{opt}\).

Find: the set of ground facts F, the extensions of relation Q, that scores best w.r.t. \(\textit{opt}(D,\textit{approx}(P,Q,F))\) and for which \(Q \cup P \cup F\) is consistent.

MiningZinc Model. Listing 15 shows the model for a relational factorization problem with a ternary conjunctive query, using the sum of absolute errors as scoring function and without additional constraints.

figure q

3.7 Ranked Tiling

Ranked tiling was introduced in [23] to find interesting areas in ranked data. In this data, each transaction defines a complete ranking of the columns. Ranked data occurs naturally in applications like sports or other competitions. It is also a useful abstraction when dealing with numeric data in which the rows are incomparable.

Problem Statement. Ranked tiling discovers regions that have high average rank scores in rank matrices. These regions are called ranked tiles. Formally, a rank tile is defined by the following optimization problem:

Problem 1

(Maximal ranked tile mining). Given a rank matrix \(\mathcal {M} \in \sigma ^{m \times n}\), \(\sigma \in \{1, \ldots n\}\) and a threshold \(\theta \), find the ranked tile \(B=(R^{*},C^{*})\), with \(R^{*} \subseteq \{1 \ldots m\}\) and \(C^{*} \subseteq \{1 \ldots n\}\), such that:

$$\begin{aligned} B = (R^{*},C^{*}) = \mathrm {arg}\!\max _{R,C} \sum _{r \in R, c \in C}(\mathcal {M}_{r,c} - \theta ). \end{aligned}$$
(4)

where \(\theta \) is an absolute-valued threshold.

Example 1

Figure 5 depicts a rank matrix containing five rows and ten columns. When \(\theta = 5\), the maximal ranked tile is defined by \(R = \{1, 2, 3, 5\}\) and \(C = \{1, 2, 3\}\). The score obtained by this tile is 37, and no more columns or rows can be added without decreasing the score.

The maximal ranked tiling problem aims to find a single tile, but we are also interested in finding a set of such tiles. This maximizes the amount of information we can get from the data. In other words, we would like to discover a ranked tiling.

Problem 2

(Ranked tiling). Given a rank matrix \(\mathcal {M}\), a number k, a threshold \(\theta \), and a penalty term P, the ranked tiling problem is to find a set of ranked tiles \(B_i = (R_i,C_i),\ i=1 \ldots k\), such that they together maximize the following objective function:

$$\begin{aligned}&\mathrm {arg}\!\max _{R_i,C_i} \sum _{r \in \mathcal{R}, c \in \mathcal{C}} \mathbbm {1}_{(t_{r,c}\ge 1)}((\mathcal {M}_{r,c} - \theta ) - (t_{r,c}-1) P) \end{aligned}$$
(5)

where \(t_{r,c} = | \{ i\in \{1,\ldots ,k\} \mid r \in R_i, c\in C_i \} |\) indicates the number of tiles that cover a cell, and \(\mathbbm {1}_\varphi \) is an indicator function that returns 1 if the test \(\varphi \) is true, and 0 otherwise. P indicates a penalty that is assigned when tiles overlap.

To solve Problem 1 efficiently, we introduce two Boolean decision vectors: \(T = (T_1, T_2, ..., T_m)\), with \(T_i \in \{0, 1\}\), for rows and \(I = (I_1, I_2, ..., I_n)\), with \(I_i \in \{0, 1\}\), for columns. An assignment to the Boolean vectors T and I corresponds to an indication of rows and columns belonging to a tile. Then, the maximal ranked tile can be found by solving the following equivalent problem:

Fig. 5.
figure 5

Example rank matrix, with maximal ranked tile \(B = (\{R1,R2,R3,R5\},\{C1,C2,C3\})\).

$$\begin{aligned}&\mathrm {arg}\!\max _{T,I } \sum _{t \in \mathcal {R}}T_t * (\sum _{i \in \mathcal {C}}(\mathcal {M}_{t,i} - \theta ) *I_i) \end{aligned}$$
(6)
$$\begin{aligned} \text {subject to}&\nonumber \\&\forall t \in \mathcal {R}: T_t = 1 \leftrightarrow \sum _{i \in \mathcal {C}}(\mathcal {M}_{t,i} - \theta )*I_i \ge 0 \end{aligned}$$
(7)
$$\begin{aligned}&\forall i \in \mathcal {C}: I_i = 1 \leftrightarrow \sum _{t \in \mathcal {R}}(\mathcal {M}_{t,i} - \theta )*T_t \ge 0 \end{aligned}$$
(8)

where redundant constraints (7), (8) are introduced to boost the search.

MiningZinc Model for Finding a Single Tile. This problem specification translates directly into the MiningZinc model shown in Listing 16 where Eqs. 7 and 8 correspond to lines 7 and 8, respectively, and the optimization criterion of Eq. 6 corresponds to line 9.

figure r

To solve Problem 2, Le Van et al. [23] propose to approximate the optimal solution by using a greedy approach, as is common for this type of pattern set mining problem. The first tile is found by solving the optimization problem in Listing 16. Next, we remove that tile by setting all cells in the matrix that are covered to the lowest rank (or another value, depending on parameter P). Then, we search in the resulting matrix for the second tile. This process is repeated until the number of desired tiles is found. The sum of the scores of all discovered tiles will correspond to the score of Eq. 5 for this solution. However, as the search is greedy, the solution is not necessarily optimal.

Python Wrapper for Greedy tile Mining. The greedy approach cannot be modelled directly in the MiningZinc language. However, the MiningZinc framework allows direct access to the solving infrastructure from Python. The complete algorithm is shown in Listing 17. The interaction with the MiningZinc module (mngzn) occurs on line 7 where the model is initialized, and on line 9 where one solution of the model is retrieved. In lines 13 through 19 the obtained tile is removed from the original matrix (by setting its entries to 0). The process is repeated until the tile no longer covers a new entry of the matrix.

figure s

4 Related Work

We have shown how MiningZinc can be used to model a wide variation of data mining and machine learning tasks in a high-level and declarative way. Our modeling language is based on MiniZinc [32] because it is a well-developed existing language with wide support in the CP community, it supports user-defined constraint, and is solver-independent. Other modeling languages such as Essence [12], Comet [37] and OPL [36] have no, or only limited, support for building libraries of user-defined constraints, and/or are tied to a specific solver.

Integrating declarative modeling and data mining has been studied before in the case of itemset mining [16, 22], clustering [11, 27] and sequence mining [31]. However, these approaches were low-level and solver dependent. The use of higher-level modeling languages and primitives has been studied before [17, 28], though again tied to one particular solving technology.

The idea of combining multiple types of data mining and machine learning techniques also lies at the basis of machine learning packages such as WEKA [18] and scikit-learn [33]. However, these packages do not offer a unified declarative language and they do not support going beyond the capabilities of the algorithms offered.

In data mining, our work is related to that on inductive databases [24]; these are databases in which both data and patterns can be queried. Most inductive query languages, e.g., [20, 26], extend SQL with primitives for pattern mining. They have only a restricted language for expressing mining problems, and are usually tied to one mining algorithm. A more advanced development is that of mining views [4], which provides lazy access to patterns through a virtual table. Standard SQL can be used for querying, and the implementation will only materialize those patterns in the table that are relevant for the query. This is realized using a traditional mining algorithm. In MiningZinc we support the integration of data from an external database through the use of SQL queries directly.

5 Solving

This chapter does not expand on solving, but the MiningZinc framework [14] supports three types of solving: (1) to use an existing MiniZinc solver; (2) to detect that the specified tasks is a standard known task and to use a specialised algorithm to solve it; and (3) a hybrid solving approach that uses both specialised algorithms and generic constraint solvers, for example by solving a master problem and subproblem with different technology, or to incorporate specialised algorithms inside global constraint propagators. The first approach is typically least efficient but most flexible towards adding extra constraints. The second approach is least flexible but typically most scalable. The third, hybrid, approach offers a trade-off between generality and efficiency, but requires modifications to the solving process, which is hence beyond what can be expressed in a modeling language like Mini(ng)Zinc.

6 Conclusion

In this chapter we showed how a wide range of data mining problems can be modeled in MiningZinc. Only a minimal library of extra predicates and functions was needed to express these problems, meaning that standard MiniZinc is often sufficient to model such problem. Two additions are the ability to load data from a database, and a library of distance functions, which are often used in data mining.

The key feature of MiningZinc as a language for expressing data mining problems is the ability to add and modify constraints and objective functions. Hence constraint-based mining problems are those where the language and framework has most to offer, such as in constraint-based pattern mining and constrained clustering. Another valuable use is for prototyping new data mining problems, as was done for relational data factorization and ranked tiling. Many other problem settings are yet unexplored.