1 Introduction

Logical Analysis of Data (LAD) is a combinatorial optimization-based supervised learning methodology that has proven useful across many disciplines (e.g., [1,2,3,4,5,6,7, 12, 18, 20,21,22,23, 27]). A key and bottleneck step in LAD is pattern generation where a set of features and their negations are optimally combined together to form knowledge/rule that distinguishes one type of data/observations from the other(s). Without loss of generality, we consider the analysis of two types of \(+\) and − data and denote by \(S^\bullet \) the index set of \(\bullet \) type of data for \(\bullet \in \{+, -\}\). Let \(S = S^+ \cup S^-\). We assume S is duplicate and contradiction free (such that \(S^+\cap S^- = \emptyset \)) and that the data under analysis are described by n Boolean attributes \(a_j\), \(j \in N {{:=}} \{1, \ldots , n\}\). We let \(a_{n+j} = \lnot {a}_j\) for \(j \in N\) and let \(N^\prime {:=} \{n+1, \ldots , 2n\}\) and \(\mathscr {N} {:=} N \cup N^\prime \). Finally, for each data \(A_i\), \(i \in S\), we denote by \(A_{ij}\) the j-th attribute value of the data; such that \(A_{ij} = 1-A_{i,n+j}\) for \(j \in N\) and \(A_{ij} = 1-A_{i,j-n}\) for \(j \in N^\prime \). Last, since \(+\) and − patterns are symmetric in definition, we present the material below in the context of \(+\) pattern generation for convenience, without loss of generality.

To build a mathematical model for pattern generation, we introduce 0–1 indicator variables \(x_j\) for \(j \in \mathscr {N}\) and let

$$\begin{aligned} x_j = {\left\{ \begin{array}{ll} 1, &{} \text {if attribute} \,a_j\, \text {is involved in a pattern; and} \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

For \(i\in S\), we let

$$\begin{aligned} J_{i} {:=} \{j \in \mathscr {N} \, | \, A_{ij} = 0\}. \end{aligned}$$

Since the dataset is duplicate and contradiction free, all \(J_i\)’s are unique. Besides, it is easy to see that \(|J_i| = n, \forall i \in S\).

In [31], we showed that the 0–1 MP below holds a unifying theory to LAD pattern generation:

$$\begin{aligned} \text{(PG) } \; : \;\; \max \left\{ \, \varphi ^+ (\mathbf {x}) + l(\mathbf {x}) \; \left| \; \varphi ^- (\mathbf {x}) = 0, \; \mathbf {x} \in \{0,1\}^{2n} \, \right. \right\} \end{aligned}$$

where \(l(\mathbf {x})\) is a linear function and

$$\begin{aligned} \varphi ^\bullet (\mathbf {x}) = \displaystyle \sum _{i\in S^\bullet }\prod _{j\in J_{i}} (1-x_j) \end{aligned}$$

for \(\bullet \in \{+, -\}\).

It is well-known that the constraint of (PG) is equivalent to a set of minimal cover inequalities [17]:

$$\begin{aligned} \displaystyle \sum _{i\in S^-}\prod _{j\in J_{i}} (1-x_j) = 0 \quad \Longleftrightarrow \quad \sum _{j \in J_i} x_j \ge 1, \;\; i \in S^- \end{aligned}$$

The minimal cover inequalities provide a poor linear programming (LP) relaxation bound, however. In [32], we studied a convex underestimation of \(\varphi ^-\) whose terms correspond to the − observations of the dataset. For the purpose, we analyzed the − data in a graph and discovered useful neighborhood properties among a set of data (forming a hypercube in a graph) and also a set of sets of data (forming a set of pairwise neighboring hypercubes in a graph) that allow for a ‘compact’ convexification of \(\varphi ^-\) in terms of a smaller number of 0–1 linear inequalities that dominate those obtained for \(\varphi ^-\) via methods from the literature [9, 17]. [32] also studied the Boolean multilinear polytope that results through the new valid inequalities and identified cases where the new inequalities are facet-defining.

Consider the 0–1 linearization of \(\varphi ^+\) now. In note that \(\varphi ^+\) is in maximization form, one only needs a (piece-wise) concave overestimating function to solve (PG) by techniques for 0–1 mixed integer and linear programming (MIP). For this task, McCormick concave envelope for a 0–1 monomial serves the purpose (e.g., [13, 26, 28, 30]) at the expense of introducing \(m^+\) (where \(m^+ = |S^+|\)) variables

$$\begin{aligned} y_i = \prod _{j \in J_i} (1 - x_j) , i \in S^+ \end{aligned}$$
(1)

and \(n \times m^+\) inequalities

$$\begin{aligned} y_i \le 1 - x_j, \;\; j \in J_i,\; i \in S^+ \end{aligned}$$
(2)

to the formulation of a 0–1 linear relaxation of (PG). Alternatively, one may aggregate the constraints in (2) with respect to j to concavify \(\varphi ^+\) by \(m^+\) valid inequalities (e.g., [29])

$$\begin{aligned} ny_i + \sum _{j \in J_i} x_j \le n, \;\; i \in S^+, \end{aligned}$$

but the resulting is a weaker relaxation of \(\varphi ^+\). In addition, this relaxation requires the \(m^+\) additional \(y_i\)’s to be 0–1 integers. The constraints in (2) can also be aggregated with respect to i via standard probing techniques and logical implications in integer programming (e.g., [14,15,16]). This yields

$$\begin{aligned} \sum _{i \in I_j} y_i \le | I_j | (1 - x_j), \;\; j \in \mathscr {N}, \end{aligned}$$

where \(I_j {:=} \{i \in S^+ \, | \, j \in J_i\}\) for \(j \in \mathscr {N}\). Although a weaker relaxation of \(\varphi ^+\) than the standard, McCormick relaxation above, this method reduces the number of constraints by \(m^+ / 2\) times, thus can be useful in data mining applications (eg, see [32]). Last, a method based on ‘mapping’ from [8] replaces each \(y_i\) in \(\sum _{i \in S^+} y_i\) by some \(1 - x_j\) for \(j \in J_i\). As which j to use is unknown, a naive implementation of this mapping method introduces up to \((m^+)^n\) upper bounding functions, leaving that its most efficient form of implementation may be the standard relaxation.

Extending the line of research in [32], this paper aims to compactly 0–1 linearize \(\varphi ^+\) in terms of a smaller number of stronger linear inequalities based on multi-term relaxation of the terms of \(\varphi ^+\). Toward this goal, we now analyze \(+\) and − data simultaneously on a graph for useful properties between terms of \(\varphi ^+\) and those in \(\varphi ^-\); that is, between a set of \(+\) data and a set of − data. More specifically, on a graph where individual data are represented as nodes, we introduce an edge between each pair of \(+\) and − nodes that are 1 Hamming distance apart in n original attributes. Then, we show that each star in a graph (which, by construction, has a − data as the internal node) simultaneously relaxes d terms of \(\varphi ^+\) to generate \(n+d\) valid inequalities for (PG) that dominate \(n \times d\) McCormick inequalities from the d leaf node(s) of the star. In addition, we show that a collection of ‘neighboring’ stars generate a much smaller number of stronger valid inequalities for (PG) based on multi-term relaxation of the \(+\) terms/data in the star set that can substitute for the McCormick inequalities from the \(+\) nodes of the stars considered for a tighter polyhedral relaxation of (PG). Furthermore, we show that our inequalities are facet-defining of the 0–1 multilinear polytope associated with the McCormick inequalities that they replace. Summarizing in different terms, the main results of this paper provide sufficient conditions and algebraic tools for automatically lifting and tightening McCormick inequalities to a fullest extent, in terms of both dimension and tightness. Mathematically and, also, through an illustrative example, we show that the maximum benefit, in regard to the size as well as the tightness of the relaxation model, is realized when ‘a maximal set of maximal stars’ are exploited for generating new valid inequalities.

As for the organization of this paper, Sect. 2 first presents the background material and then moves on to discovering the aforementioned main results of this paper. Section 3 illustrates the construction and benefits of the new polyhedral concavification scheme for \(\varphi ^+\) with a small artificial dataset of 10 observations while Sect. 4 demonstrates practical utilities of the main results through experiments with 12 public datasets from [24], in comparison with cutting plane methods implemented in CPLEX [19]. Finally, concluding remarks are provided in Sect. 5.

2 Main results

Similarly as in [32], we engage in polyhedral analysis of feasible region of (PG) via a graph theoretical analysis of data in this paper for a compact and tighter 0–1 linearization of \(\varphi ^+\). The difference is that both \(+\) and − data are now analyzed on one graph for the discovery of useful neighborhood properties between a set of \(+\) and a set of − data. To obtain a graph representation of a given dataset S, we consider each data as a node in a graph and introduce an edge between a pair of \(+\) and − observations if they are 1 Hamming distance apart in n original attributes. This yields an undirected graph G.

Let us recall the definition of a star in a graph (refer Fig. 1) and follow it with the definition of the Hamming distance for measuring similarity between two Boolean data.

Definition 1

A star\(\mathfrak {S}\) is a tree with one internal/center node adjacent to all other leaf nodes. The degree d of a star refers to the the degree of the internal node; thus, the number of leaves/edges in it.

Fig. 1
figure 1

Degree 4 star with internal node a and 4 leaves bcde

Definition 2

The Hamming distance\(H(\mathbf {x}, \mathbf {y})\) between two 0–1 vectors \(\mathbf {x}\) and \(\mathbf {y}\) is the number of positions in which they differ.

The following provides a means for comparing the relative strength of two valid inequalities for a set of linear inequalities.

Definition 3

If \(\pi x \le \pi _0\) and \(\mu x \le \mu _0\) are two valid inequalities for a polytope in the nonnegative orthant, \(\pi x \le \pi _0\)dominates\(\mu x \le \mu _0\) if there exists \(u>0\) such that \(u \mu \le \pi \) and \(\pi _0 \le u \mu _0\), and \((\pi , \pi _0) \not = (u \mu , u \mu _0)\).

Before proceeding, we summarize an important complementary relation between a pair of Boolean attributes/variables that is exploited in obtaining stronger valid inequalities for (PG).

Proposition 1

(Proposition 1 in [32]) For \(j \in \mathscr {N}\), let \(j^c = n+j\) if \(j \in N\) and \(j^c = j-n\) if \(j \in N^\prime \). Then, the complementarity cut

$$\begin{aligned} x_j + x_{j^c} \le 1, \end{aligned}$$
(3)

is valid for (PG).

Let:

$$\begin{aligned} \mathscr {I}_{PG} {:=} \left\{ \mathbf {x} \in \{0,1\}^{2n}, \, \mathbf {y} \in [0,1]^{m^+} \, \left| \, \varphi ^-(\mathbf {x}) = 0, \, 1, \, 3 \right. \right\} \end{aligned}$$

The following subsections deal with a tighter polyhedral relaxation of \(\mathscr {I}_{PG}\) via a smaller number of stronger valid inequalities and study the multilinear polytope associated with these inequalities.

2.1 Stronger valid inequalities from a star

We begin with a study of a linked pair of nodes in a graph (that is, a pair of \(+\) and − data that are 1 Hamming distance away in n original attributes) and then extend the idea to a set of nodes forming a (maximal) star in a graph. For insightful information, we first consider an arbitrary pair of \(+\) and − data.

Proposition 2

For a pair of observations \(A_i, i \in S^+\), and \(A_\ell , \ell \in S^-\), the following inequality

$$\begin{aligned} y_i \le \sum _{j \in J_\ell \setminus J_i} x_j \end{aligned}$$
(4)

is valid for \(\mathscr {I}_{PG}\).

Proof

As noted earlier, the minimal cover inequality associated with \(A_\ell \) is:

$$\begin{aligned} \sum _{j \in J_\ell } x_j \ge 1 \end{aligned}$$
(5)

Note that \( | J_\ell \setminus J_i | \in \{ 1,\ldots , n \}\). Suppose \(| J_\ell \setminus J_i | = n\). Then, we have

$$\begin{aligned} \sum _{j \in J_\ell \setminus J_i} x_j = \sum _{j \in J_\ell } x_j \ge 1 \ge y_i. \end{aligned}$$

On the other hand, if \(| J_\ell \setminus J_i | \le n-1\), (2) can be rewritten for \(A_i\) as

$$\begin{aligned} x_j \le 1 - y_i, \quad j \in J_i \end{aligned}$$

and using this in (5) yields:

$$\begin{aligned} \left| J_i \cap J_\ell \right| (1 - y_i) + \sum _{j \in J_\ell \setminus J_i} x_j \ge 1 \end{aligned}$$

Now, with all \(x_j\) 0–1 integer variables, \(y_i\) is intrinsically binary, too. Finally, this reduces the foregoing inequality to

$$\begin{aligned} (1 - y_i) + \sum _{j \in J_\ell \setminus J_i} x_j \ge 1 \end{aligned}$$

and completes the proof. \(\square \)

An acute reader notes in the proof above that (4) becomes stronger as \(H(A_i, A_\ell )\) gets smaller. That is, the strongest inequality of form (4) results when a pair of \(+\) and − observations has the Hamming distance equal to 1 in the original attributes such that \(| J_\ell \setminus J_i | = 1\). By construction, each such pair has an edge in between in our graph representation of data, and the following is a result that applies to every edge in G.

Lemma 1

For a pair of observations \(A_i, i \in S^+\), and \(A_\ell , \ell \in S^-\) with \(J_\ell \setminus J_i = \{ j \}\), the inequality

$$\begin{aligned} y_i \le x_j \end{aligned}$$
(6)

is valid for \(\mathscr {I}_{PG}\) and dominates \( y_i \le 1 - x_{j^c} \text { and } x_j \ge 0.\)

Proof

\(J_\ell \setminus J_i = \{ j \}\) indicates that \(A_{ij} = 1\) and \(A_{\ell j} = 0\) and \(J_i \setminus J_\ell = \{ j^c \}\), and the validity of (6) is immediate from Proposition 2. Now, (6), along with (3), yields

$$\begin{aligned} 0 \le y_i \le x_j \le 1 - x_{j^c} \end{aligned}$$

and completes the proof. \(\square \)

Remark 1

The strength of (6) primarily lies in that it eliminates the possibilities \(x_j=x_{j^c} = 0\) and \(y_i=1\) once for good. This is illustrated in Fig. 2, in comparison with its counterpart McCormick inequality. \(\square \)

Fig. 2
figure 2

\(y_i \le 1 - x_{j^c}\) (McCormick) versus \(y_i \le x_j\) (Lemma 1)

Let us extend the result above for a (maximal) star \(\mathfrak {S}\) in G with a − observation \(A_{\ell }\) as the center node and a set \(U(U \subseteq S^+, |U| \ge 2)\) of \(+\) observations as its leaves. For notational simplicity, we let

$$\begin{aligned} \varDelta {:=} \bigcup _{i \in U} \{j \in N | A_{ij} \not = A_{\ell j}\} \end{aligned}$$

and, for each \(j \in \varDelta \), denote by \(i_j\) the index of \(+\) observation in U that is different from \(A_\ell \) in the j-th position; that is, \(A_{i_j j} \not = A_{\ell j}, A_{i_j j'} = A_{\ell j'}, \forall j' \in N \setminus \{j\}\).

Theorem 1

Consider a star \(\mathfrak {S}\) formed by one − observation \(A_\ell \) and a set \(U (U \subseteq S^+, d {=} |U| \ge 2)\) of \(+\) observations. Then, for each \( j \in J^\star {:=} \bigcap _{i \in U} J_i \cap J_\ell (\not = \emptyset )\), the inequality

$$\begin{aligned} \sum _{i \in U} y_i \le 1 - x_j \end{aligned}$$
(7)

is valid for \(\mathscr {I}_{PG}\) and dominates \(y_i \le 1 - x_j\) for all \(i \in U\).

Proof

Consider the McCormick concave envelope for each \(y_i, i \in U:\)

$$\begin{aligned} y_i \le 1 - x_j, \;\; j \in J_i \end{aligned}$$

It is easy to see that if \(x_j = 1\) for some \(j \in J^\star \), then the minimal cover inequality (5) from \(A_\ell \) is satisfied, and \(y_i = 0, \forall i \in U\). Hence, we have:

$$\begin{aligned} \sum _{i \in U} y_i = 0 \le 1 - x_j, \;\; j \in J^\star \end{aligned}$$

Now suppose \(x_j = 0, \forall j \in J^\star \). Without loss of generality, let \(A_{\ell j} = 0, \forall j \in \varDelta \). For some \(\iota \in \varDelta \), by setting \(x_\iota = 1\), (5) is satisfied, and \(y_i = 0, \forall i \in U \setminus \{i_\iota \}\) since \(A_{i \iota } = 0 \) (which implies \(y_i \le 1 - x_\iota \)), \(\forall i \in U \setminus \{i_\iota \}\). In addition, it follows from Lemma 1 that \( y_{i_\iota } \le x_\iota = 1. \) Thus,

$$\begin{aligned} \sum _{i \in U} y_i \le 1 = 1 - x_j, \;\; j \in J^\star . \end{aligned}$$

The dominance relation is immediate from \(y_i \le \sum _{i \in U} y_i\) for all \(i \in U\). \(\square \)

Fig. 3
figure 3

Data for Example 1. a Observations under analysis b A star \(\mathfrak {S}\) formed by the five observations

We briefly pause for an example to help the reader become acquainted with the notation of this paper and also to illustrate how the results thus far work.

Example 1

Consider a dataset in Fig. 3a that form a star \(\mathfrak {S}\) in Fig. 3b, where \(J_\ell \setminus J_i\) is indicated for each \(i \in \{1, \ldots , 4\}\). Applying the standard McCormick concave relaxation, we obtain 20 inequalities below:

$$\begin{aligned}&y_1 \le 1 - x_1, \;\; y_1 \le 1 - x_5, \;\; y_1 \le 1 - x_7, \;\; y_1 \le 1 - x_8, \;\; y_1 \le 1 - x_9 \\&y_2 \le 1 - x_4, \;\; y_2 \le 1 - x_5, \;\; y_2 \le 1 - x_6, \;\; y_2 \le 1 - x_7, \;\; y_2 \le 1 - x_8 \\&y_3 \le 1 - x_1, \;\; y_3 \le 1 - x_3, \;\; y_3 \le 1 - x_4, \;\; y_3 \le 1 - x_5, \;\; y_3 \le 1 - x_7 \\&y_4 \le 1 - x_1, \;\; y_4 \le 1 - x_2, \;\; y_4 \le 1 - x_4, \;\; y_4 \le 1 - x_5, \;\; y_4 \le 1 - x_8 \end{aligned}$$

Applying Lemma 1 for \(A_1\) and \(A_5\), we obtain

$$\begin{aligned} y_1\le x_4 \end{aligned}$$

and this inequality dominates its counterpart McCormick inequality

$$\begin{aligned} y_1 \le 1 - x_9. \end{aligned}$$

Likewise, (6) in Lemma 1 yields

$$\begin{aligned} y_2 \le x_1, \;\; y_3 \le x_8 \; \text{ and } \; y_4 \le x_7 \end{aligned}$$

which, respectively, dominate

$$\begin{aligned} y_2 \le 1 - x_6, \;\; y_3 \le 1 - x_3 \; \text{ and } \; y_4 \le 1 - x_2. \end{aligned}$$

To apply the result in Theorem 1, we note that \(U = \{1, 2, 3, 4\}\) and \(J^\star = \{5\}\). Therefore, via (7), we obtain

$$\begin{aligned} y_1 + y_2 + y_3 + y_4 \le 1 - x_5. \end{aligned}$$

As seen, the above inequality dominates the four inequalities below by the standard linearization method:

$$\begin{aligned} y_1 \le 1 - x_5, \;\; y_2 \le 1 - x_5, \;\; y_3 \le 1 - x_5 \;\; \text{ and } \;\; y_4 \le 1 - x_5 \end{aligned}$$

Remark 2

The inequality in (7) resembles the inequality \(\sum _{i \in I_j} y_i \le | I_j | (1 - x_j)\) that is obtained for \(j \in \mathscr {N}\) via a simple aggregation based on standard probing techniques and logical implications in integer programming (e.g., [14,15,16]). The difference is that the former involves less \(y_i\)’s but has 1 as the coefficient of \(1- x_j\) in the right-hand side of the inequality. An acute reader may note that (7) is an inequality that results when the individual McCormick/standard inequalities from the \(+\) leaf nodes of the star are lifted and tightened to the fullest extent. This insight explains why the new inequality from a star is stronger than the individual inequalities used for obtaining it, as illustrated in the preceding example. \(\square \)

When a leaf node of a star in Theorem 1 is purposefully deleted, we obtain a valid inequality in one less y variable and a different x variable than the one in (7) that is non-dominated by (7) and dominates the set of McCormick inequalities from the remaining leaf nodes (data) of the reduced star. We summarize this result next.

Corollary 1

Consider a star described in Theorem 1 with \( d \ge 3\). For each \(\iota \in \varDelta \), let \(J_\ell \setminus J_{i_\iota } = \{j\}\), the inequality

$$\begin{aligned} \sum _{i \in U \setminus \{ i_\iota \}} y_i \le 1 - x_j \end{aligned}$$
(8)

is valid for \(\mathscr {I}_{PG}\) and dominates \(y_i \le 1 - x_j\) for all \(i \in U \setminus \{i_\iota \}\).

Proof

Consider the star obtained from \(\mathfrak {S}\) by removing a leaf \(A_{i_\iota }\). The result is immediate from Theorem 1. \(\square \)

A few remarks are due.

Remark 3

In Corollary 1, \(j = \iota \) if \(A_{\ell \iota } = 0\); while \(j = \iota + n\) if \(A_{\ell \iota } = 1\) (refer Fig. 3). \(\square \)

Remark 4

Note that if \(d=1\) in Theorem 1 or \(d=2\) in Corollary 1, (7) and (8) reduce to the inequalities given by the standard McCormick relaxation method, respectively. The reader can verify that these two inequalities are made the strongest when a maximal star is considered for their construction; the resulting involves the maximum number of y variables in the left-hand side.

Furthermore, note for any star \(\mathfrak {S}\) that \(\varDelta \) is fixed. Therefore, removing more than a single observation from U, as done so in Corollary 1, produces a valid inequality that has a less number of \(y_i\)’s in the left-hand side than (8), thus is dominated by the latter. \(\square \)

Example 2

(Illustration of Corollary 1) Let us recall the data in Fig. 3. Note that \(\varDelta = \bigcup _{i \in U} \{j \in N | A_{ij} \not = A_{5 j}\} = \{1,2,3,4\}\). For \(\iota = 1\), we have \(i_1 = \{2\}\), thus \(J_5 \setminus J_2 = \{1\}\). So, we obtain via (8)

$$\begin{aligned} y_1 + y_3 + y_4 \le 1 - x_1 \end{aligned}$$

which dominates

$$\begin{aligned} y_1 \le 1 - x_1, \;\; y_3 \le 1 - x_1 \;\; \text{ and } \;\; y_4 \le 1 - x_1. \end{aligned}$$

In the same fashion, one uses Corollary 1 for other \(\iota \)’s of \(\varDelta \) to obtain

$$\begin{aligned} y_1 + y_2 + y_3 \le 1 - x_7, \;\; y_1 + y_2 + y_4 \le 1 - x_8 \;\; \text{ and } \;\; y_2 + y_3 + y_4 \le 1 - x_4 \end{aligned}$$

which dominate the set of McCormick inequalities involving \(x_7\), \(x_8\) and \(x_4\), respectively.

In summary, for analyzing the dataset in Fig. 3, one can use the 9 valid inequalities in Example 1 and above in place of the 20 McCormick inequalities to obtain a tighter polyhedral overestimation of \(\varphi ^+\). \(\square \)

For analyzing the strength of the polyhedral inequalities above, we let

$$\begin{aligned} X {:=} \left\{ \left. x_j \in \{0, 1\}, j \in \mathscr {N}, \,\, y_i \in [0,1], i \in U \, \right| \, 3, \, 5, \, y_i \le 1 - x_j, j \in J_i, i \in U \right\} . \end{aligned}$$

Then, the following holds (and we defer its proof until the end of the next subsection).

Theorem 2

For a star \(\mathfrak {S}\) described in Theorem 1, (7) and (8) define facets of conv(X).

2.2 Stronger valid inequalities from a set of stars

Let us consider a set of \(q \, (\ge 2)\) (maximal) stars \(\mathfrak {S}_1, \ldots , \mathfrak {S}_q \) and let \(A_{\ell _k}\) denote the internal node of \(\mathfrak {S}_k\) and \(U_k\) denote the set of \(+\) observations (that is, the leaf nodes) of \(\mathfrak {S}_k\), \(k \in Q {:=} \{1, \ldots , q\}\). Further, we let

$$\begin{aligned} \varDelta _k {:=} \bigcup _{i \in U_k} \{ \left. j \in N \, \right| \, A_{ij} \not = A_{\ell _k j} \}, \;\; k \in Q \end{aligned}$$

for each star \(\mathfrak {S}_k\). That is, \(\varDelta _k\) collects the indices where individual leaf nodes of the star \(\mathfrak {S}_k\) are different from the center in n original attributes. We also recall that each \(A_{\ell _k}\) corresponds to a term in the constraint function \(\varphi ^-\) of (PG), which is 0–1 linearized by the standard method into a minimal cover inequality [17] as follows:

$$\begin{aligned} \sum _{j \in J_{\ell _k}} x_j \ge 1, \;\; k \in Q \end{aligned}$$
(9)

Finally, for a pair of stars, we let

$$\begin{aligned} \varDelta _{kl} {:=} \{ \left. j \in N \, \right| \, A_{\ell _k j} \not = A_{\ell _l j}\}, \;\; k, l \in Q \end{aligned}$$

to store information on attribute(s) where the two internal nodes of the stars differ (in the original attributes).

With the notation above, consider the following result.

Theorem 3

Consider a set Q of \(q \, (\ge 2)\) stars. For any \(k, l \in Q\) and \(j' \in \varDelta _k\) and \(j'' \in \varDelta _l\), suppose that \(i_{j'} \not = i_{j''}\)and neither of the two conditions below holds:

  1. i)

    \(j' \not = j''\) and \(j', j'' \in \varDelta _{kl}\)

  2. ii)

    \(j' = j''\) and \(j', j'' \not \in \varDelta _{kl}\)

Then, for each \( j \in \mathfrak {J} {:=} \bigcap _{k \in Q} J_k^\star \, (\not = \emptyset )\), where \(J^\star _k {:=} \bigcap _{i \in U_k} J_i \cap J_{\ell _k}\), the inequality

$$\begin{aligned} \sum _{i \in \mathscr {U}} y_i \le 1 - x_j \end{aligned}$$
(10)

is valid for \(\mathscr {I}_{PG}\), where \( \mathscr {U} {:=} \bigcup _{k \in Q} U_k.\)

Proof

If \(x_j = 1\) for any \(j \in \mathfrak {J}\), then (9) is satisfied and \(y_i = 0, \forall i \in \mathscr {U}\). Therefore, \(\sum _{i \in \mathscr {U}} y_i = 0\), and (10) is trivially satisfied.

On the other hand, if \(x_j = 0, \forall j \in \mathfrak {J}\), we need to show that

$$\begin{aligned} \sum _{i \in U_k} y_i + \sum _{i \in U_l} y_i \le 1 \end{aligned}$$
(11)

holds for any pair \(k, l \in Q\) satisfying the conditions of the theorem. Equivalently, it suffices to show that, for every \(j' \in \varDelta _k\) and \(j'' \in \varDelta _l\) (\(i_{j'} \not = i_{j''}\)) satisfying neither of i) and ii), we have:

$$\begin{aligned} y_{i_{j'}} + y_{i_{j''}} \le 1 \end{aligned}$$
(12)

(As for reason, via (7) in Theorem 1, we have: \( \sum _{i \in U_k} y_i \le 1 \text { and } \sum _{i \in U_l} y_i \le 1\). Thus, \(y_{i_{j'}} = 1\), for instance, indicates that \( y_i = 0, \forall i \in U_k \setminus \{i_{j'}\}\) and \(y_{i} = 0, \forall i \in U_l\) via (12).)

For a pair of stars under consideration, the minimal cover inequalities in (9) associated with \(A_{\ell _k}\) and \(A_ {\ell _l}\) also need to be satisfied by a 0–1 feasible solution to \(\mathscr {I}_{PG}\); that is, at least one x in them needs to be set to 1. Without loss of generality, assume \(A_{\ell _k j'} = 0\) and consider the two possible cases for \(j' \in \varDelta _k\) and \(j'' \in \varDelta _l\) with \(i_{j'} \not = i_{j''}\).

Case 1 (\(j' \not = j''\)). In this case we have \(j' \not \in \varDelta _{kl} \text { or } j'' \not \in \varDelta _{kl}\). Without loss of generality, assume \(j' \not \in \varDelta _{kl}\). This yields

$$\begin{aligned} A_{\ell _l j'} = 0, \,\,A_{i_{j'} j'} = 1 \text{ and } A_{i_{j''} j'} = 0. \end{aligned}$$

Thus, setting \(x_{j'} = 1\) satisfies (9) for \(\ell _k\) and \(\ell _l\) and yields

$$\begin{aligned} y_{i_{j'}} \le x_{j'} = 1, \,\, y_{i_{j''}} \le 1 - x_{j'} = 0, \end{aligned}$$

which satisfies (12).

Case 2 (\(j' = j''\)). In this case, we have \(j' (=j'') \in \varDelta _{kl}\) such that \(A_{\ell _l j'} = 1\), \(A_{i_{j'} j'} = 1\) and \(A_{i_{j''} j'} = 0\). Therefore, by setting \(x_{j'} = 1\), we have (9) satisfied for \(\ell _k\) and obtain

$$\begin{aligned} y_{i_{j'}} \le x_{j'} = 1, \,\, y_{i_{j''}} \le x_{n+j''} = x_{n+j'} = 0. \end{aligned}$$

(9) is also satisfied for \(\ell _l\); depending on which \(\mathbf {x}\) is set to 1, we either have \(y_{i_{j'}} \le 1\) or \( y_{i_{j'}} \le 0\). Therefore, we have (12) satisfied for this case as well.

Concluding, with (11) holding for each pair of stars satisfying the conditions of the theorem, at most one \(y_i, i\in \mathscr {U}\) can take value 1. Along with \(x_j = 0, \forall j \in \mathfrak {J}\), this shows that \(\sum _{i \in \mathscr {U}} y_i \le 1 = 1 - x_j, \, \forall j \in \mathfrak {J}\) and completes the proof. \(\square \)

Remark 5

Briefly, if either condition of Theorem 3 is satisfied, the right-hand side value of the inequality in (11) becomes larger than 1 for some pair of stars, thereby making (10) invalid for \(\mathscr {I}_{PG}\). \(\square \)

Now, we let

$$\begin{aligned} \Xi {:=} \left\{ \left. x_j \in \{0,1\}, j \in \mathscr {N}, y_i \in [0,1], i \in \mathscr {U} \, \right| \,3, \, 9, \, y_i \le 1 - x_j, \, j \in J_i , \, i \in \mathscr {U} \right\} \end{aligned}$$

and analyze the strength of each of the new inequalities in (10) for \(j \in \mathfrak {J}\).

Theorem 4

For a set of stars in Theorem 3, (10) defines a facet of \(conv(\Xi )\).

Proof

The validity of (10) was established in the proof of Theorem 3.

Suppose now that (10) is not facet-defining. Then, there exists a facet-defining inequality of \(\Xi \) of the form

$$\begin{aligned} \sum _{j \in \mathscr {N}} \alpha _j x_j + \sum _{i \in \mathscr {U}} \beta _i y_i \le \gamma , \end{aligned}$$
(13)

where \((\varvec{\alpha }, \varvec{\beta }) \not = (\mathbf {0}, \mathbf {0})\), such that (10) defines a face of the facet of \(conv(\Xi )\) defined by (13). That is,

$$\begin{aligned} F_1{:=} \left\{ (\mathbf {x}, \mathbf {y}) \in \Xi \, \left| \, x_j + \sum _{i \in \mathscr {U}} y_i = 1 \right. \right\} \subseteq F {:=} \left\{ (\mathbf {x}, \mathbf {y}) \in \Xi \, \left| \, \sum _{j \in \mathscr {N}} \alpha _j x_j + \sum _{i \in \mathscr {U}} \beta _i y_i = \gamma \right. \right\} . \end{aligned}$$

Consider the following two cases for the solutions in \(F_1\).

Case 1 (\(x_j=1\)). Since \(j \in \mathfrak {J}\), this solution satisfies (9). The solution with \(x_{j'} = 0, \forall j' \in \mathscr {N} \setminus \{j\}\) and \(y_i = 0, \, \forall i \in \mathscr {U}\) belongs to \(F_1\) hence F, which yields \(\alpha _j = \gamma \). For this solution, suppose \(x_{j'} =1\) for some \(j' \in \mathscr {N} \setminus \{j, j^c\}\). This yields \( \alpha _j + \alpha _{j'} = \gamma \), thus:

$$\begin{aligned} \alpha _j = \gamma \; \text{ and } \; \alpha _{j'} = 0, \, \forall j' \in \mathscr {N} \setminus \{j, j^c\} \end{aligned}$$

Case 2 (\(x_j = 0\)). The existence of a pattern is guaranteed for a contradiction-free dataset. This implies that there always exists a 0–1 vector \(\mathbf {x}\) that yields \(y_i=1\) for each \(i \in \mathscr {U}\). Also, note that the value of \(x_{j^c}\) does not affect \(\mathbf {y}\); thus, \(\beta _i = \beta _i + \alpha _{j^c} = \gamma \) for all \(i \in \mathscr {U}\). These yield

$$\begin{aligned} \alpha _{j^c} = 0 \; \text{ and } \; \beta _i = \gamma , \, \forall i \in \mathscr {U}. \end{aligned}$$

Summarizing, the two cases above show that

$$\begin{aligned} \alpha _{j'} = 0, \, \forall j' \in \mathscr {N} \setminus \{j\}; \; \text{ and } \; \alpha _j = \beta _i = \gamma > 0, \, \forall i \in \mathscr {U}, \end{aligned}$$

where \(\gamma > 0\) is from our supposition that (10) is dominated by (13). This indicates that (13) is a positive multiple of (10) (which is a contradiction) and completes the proof.\(\square \)

For a set of stars \(\mathfrak {S}_k, k \in Q\), let

$$\begin{aligned} \varDelta _\otimes {:=} \bigcup _{k \in Q} \varDelta _k \;\; \text{ and } \;\; \varDelta _\odot {:=} \bigcup _{k, l \in Q, \, k \not = l} \varDelta _{kl} \end{aligned}$$

and consider the following result.

Theorem 5

Consider a set of stars in Theorem 3. For each \(\iota \in \varDelta _\otimes \setminus \varDelta _\odot \), assume without loss of generality that \(\iota \in \varDelta _k\), \(k \in Q\), and let \(i_\iota \in \mathscr {U}\) be such that \(J_ {\ell _k} \setminus J_{i_\iota } = \{j\}\). Then, the inequality

$$\begin{aligned} \sum _{i \in \mathscr {U} \setminus \{ i_\iota \} } y_i \le 1 - x_j \end{aligned}$$
(14)

is valid for \(\Xi \) and defines a facet of \(conv(\Xi )\).

Proof

The validity of (14) for \(\Xi \) is immediate from Theorem 3 for the sub-graph with \(A_{i_\iota }\) removed from the set of stars for the theorem. As for the facet result, we suppose the contrary that (14) is not facet-defining for \(conv(\Xi )\). This suggests

$$\begin{aligned} F_2 {:=} \left\{ (\mathbf {x}, \mathbf {y}) \in \Xi \, \left| \, x_j + \sum _{i \in \mathscr {U} \setminus \{i_\iota \}} y_i = 1 \right. \right\} \subseteq F, \end{aligned}$$

where F is given by a facet-defining inequality of \(conv(\Xi )\) of the form (13), as in the proof of Theorem 3. Now, consider the following two cases for any solution of \(F_2\).

Case 1 (\(x_j=1\)). Note that \(j \in J^\cap {:=} \cap _{k \in Q} J_{\ell _k}\), thus this solution satisfies (9). The solution with \(x_{j'} = 0, \, \forall j' \in \mathscr {N} \setminus \{j\}\) and \(y_i = 0, \forall i \in \mathscr {U}\) belongs to \(F_2\), hence F. This yields \(\alpha _j = \gamma \). For this solution, we can set \(x_{j'}=1\) for each \(j' \in \mathscr {N} \setminus \{j, j^c\}\) to obtain \(\alpha _j + \alpha _{j'} = \gamma \). We thus have

$$\begin{aligned} \alpha _j = \gamma \; \text{ and } \; \alpha _{j'} = 0, \, \forall j' \in \mathscr {N} \setminus \{j, j^c\}. \end{aligned}$$

Moreover, as noted in the proof of Theorem 4, the existence of a pattern for a contradiction-free dataset guarantees the existence of \(\mathbf {x}\) such that \(y_{i_\iota } = 1\). This yields:

$$\begin{aligned} \alpha _j + \beta _{i_\iota } = \gamma \quad \Longrightarrow \quad \beta _{i_\iota } = 0 \end{aligned}$$

Case 2 (\(x_j=0\)). Again, there exists a 0–1 vector \(\mathbf {x}\) that yields \(y_i=1\) for each \(i \in \mathscr {U} \setminus \{i_\iota \}\). If \(x_{j^c} = 0\) (or \(x_{j^c} = 1\)) in such a solution, we have \(\beta _i = \gamma \) (or \(\alpha _{j^c} + \beta _i = \gamma \)) for \(i \in \mathscr {U} \setminus \{i_\iota \}\), hence obtain:

$$\begin{aligned} \alpha _{j^c} = 0 \; \text{ and } \; \beta _i = \gamma , \, \forall i \in \mathscr {U} \setminus \{i_\iota \} \end{aligned}$$

By the two cases above, we have

$$\begin{aligned} \alpha _{j'} = 0, \, \forall j' \in \mathscr {N} \setminus \{j\}; \; \beta _{i_\iota } = 0; \; \text{ and } \; \alpha _j = \beta _i = \gamma > 0, \, \forall i \in \mathscr {U} \setminus \{i_\iota \}, \end{aligned}$$

where \(\gamma > 0\) is from the supposition that (14) is dominated by (13). This shows that (13) is a positive multiple of (14) and completes the proof. \(\square \)

We have another interesting result below.

Theorem 6

Consider a set of stars in Theorem 3. For each \(\iota \in \varDelta _\odot \), let:

$$\begin{aligned} \mathscr {U}_0^\iota := \{ i \in \mathscr {U} | A_{i \iota } = 0 \} \;\; \text{ and } \;\; \mathscr {U}_1^\iota := \{ i \in \mathscr {U} | A_{i \iota } = 1 \} \end{aligned}$$

Then, a pair of inequalities

$$\begin{aligned} \sum _{i \in \mathscr {U}_0^\iota } y_i \le 1 - x_\iota \;\; \text{ and } \;\; \sum _{i \in \mathscr {U}_1^\iota } y_i \le 1 - x_{\iota + n} \end{aligned}$$
(15)

are valid for \(\Xi \) and define facets of \(conv(\Xi )\).

Proof

Again, the validity result is immediate; apply Theorem 3 to the two sub-graphs, one with \(A_i, i \in \mathscr {U}_0^\iota \) and the other \(A_i, i \in \mathscr {U}_1^\iota \), respectively. As for the facet result, since the inequalities in (15) are symmetric, if suffices to show it for the former. For the purpose, we again suppose the contrary to let

$$\begin{aligned} F_3 {:=} \left\{ (\mathbf {x}, \mathbf {y}) \in \Xi \, \left| \, x_\iota + \sum _{i \in \mathscr {U}_0^\iota } y_i = 1 \right. \right\} \subseteq F \end{aligned}$$

(where F is as defined in Theorem 3) and investigate the two possible cases for the solutions of \(F_3\).

Case 1 (\(x_\iota = 1\)). The solution with \(x_j = 1\) for some \(j \in \mathfrak {J}\), \(x_{j'} = 0, \forall j' \in \mathscr {N} \setminus \{\iota , j\}\) and \(y_i = 0, \forall i \in \mathscr {U}\) satisfies (9), hence belongs to \(F_3\) and F. This yields \(\alpha _j + \alpha _\iota = \gamma , \, j \in \mathfrak {J}\). Let \(J {:=} J^\cap \setminus \{\mathfrak {J}\}\) (where \( J^\cap = \cap _{k \in Q} J_{\ell _k}\)) and note that the solution with \(x_{j'}=1\) for some \(j' \in J\), \(x_{j''} = 0, \forall j'' \in \mathscr {N} \setminus \{j', \iota \}\) and \(y_i = 0, \forall i \in \mathscr {U}\) also satisfies (9), hence belongs to \(F_3\) and F. This yields \(\alpha _{j'} + \alpha _\iota = \gamma , \, j' \in J\). Therefore, the combination of these two solutions yields \(\alpha _j + \alpha _{j'} + \alpha _\iota = \gamma , \, j \in \mathfrak {J}, j' \in J\); hence \(\alpha _\iota = \gamma , \, \alpha _j = 0, j \in J^\cap \). By setting \(x_{j''} = 1\) for each \(j'' \in \mathscr {N} \setminus (J^\cap \cup \{\iota ^c\})\) in the first solution above, we obtain:

$$\begin{aligned} \alpha _\iota + \alpha _{j''} = \gamma \quad \Longrightarrow \quad \alpha _\iota = \gamma , \; \text{ and } \; \alpha _j = 0, \forall j \in \mathscr {N} \setminus \{\iota , \iota ^c\} \end{aligned}$$

Note that \(x_\iota = 1\) implies \(y_i = 0, \forall i \in \mathscr {U}_0^\iota \). Additionally, there is a 0–1 vector \(\mathbf {x}\) that yields \(y_i = 1\) for each \(i \in \mathscr {U}_1^\iota \). Therefore, we have:

$$\begin{aligned} \alpha _\iota + \beta _i = \gamma , \, i \in \mathscr {U}_1^\iota \quad \Longrightarrow \quad \beta _i = 0, \forall i \in \mathscr {U}_1^\iota \end{aligned}$$

Case 2 (\(x_\iota = 0\)). Again, for a 0–1 vector \(\mathbf {x}\) that yields \(y_i = 1\) for each \(i \in \mathscr {U}_0^\iota \), if \(x_{\iota ^c} = 0\) (or \(x_{\iota ^c} = 1\)), we have \(\beta _i = \gamma \) (or \(\alpha _{\iota ^c} + \beta _i = \gamma \)) for \(i \in \mathscr {U}_0^\iota \). Therefore,

$$\begin{aligned} \alpha _{\iota ^c} = 0 \; \text{ and } \; \beta _i = \gamma , \forall i \in \mathscr {U}_0^\iota . \end{aligned}$$

The two cases above yield

$$\begin{aligned} \alpha _j = 0, \forall j \in \mathscr {N} \setminus \{\iota \}; \; \beta _i = 0, \forall i \in \mathscr {U}_1^\iota ; \; \text{ and } \; \alpha _\iota = \beta _i = \gamma > 0, \forall i \in \mathscr {U}_0^\iota \end{aligned}$$

where \(\gamma > 0\) is from our supposition that (15) is dominated by (13). This shows that (13) is a positive multiple of (15) and completes the proof. \(\square \)

Remark 6

Denote by \(d_k\) the degree of (the internal node of) \(\mathfrak {S}_k\) for \(k \in Q\). Then, the number of inequalities generated from a set of stars by Theorem 3 is equal to \(|\mathfrak {J}| = n - |\varDelta _\otimes \cup \varDelta _\odot |\). Likewise, the number of inequalities generated via Theorems 5 and 6 are calculated to be \(|\varDelta _\otimes \setminus \varDelta _\odot |\) and \(2|\varDelta _\odot |\), respectively. Therefore, the total number of inequalities from a set of stars is:

$$\begin{aligned} \sum _{k \in Q} d_k + n - \left| \varDelta _\otimes \cup \varDelta _\odot \right| + \left| \varDelta _\otimes \setminus \varDelta _\odot \right| + 2 \left| \varDelta _\odot \right| = \sum _{k \in Q} d_k + n + \left| \varDelta _\odot \right| \end{aligned}$$

\(\square \)

Remark 7

As hinted in Remark 2 with a star inequality, the main results of this paper can be understood as providing sufficient conditions and mechanisms for ‘algebraic lifting and coefficient tightening’ a maximal set of McCormick inequalities to a fullest extent. By fullest, we mean two things. One, the number of y variables in each new inequality is maximized. The other, the coefficient of each and every decision variables \(\mathbf {x} \in \{0,1\}^{2n}\) and \(\mathbf {y} \in [0,1]^{m^+}\) is kept at 1, which attributed to obtaining facet-defining results in Theorems 2, 4, 5, and 6. Furthermore, as understood by now, the maximum benefit is realized when a maximal set of maximal stars are exploited for generating the inequalities of this subsection (refer Remark 4).

As a summary, the dominance relation among the inequalities dealt with and developed in this paper is captured into the hierarchy of their relative strength in Fig. 4. \(\square \)

Fig. 4
figure 4

Dominance relation among valid inequalities for \(\varphi ^+\) of (PG). ‘\(\longrightarrow \)’ indicates dominance of the head over the tail, while the dotted bidirectional arrow indicates that the two are equivalent; specifically, the one placed northeast reduces to the other under the condition provided as the legend

Last, we note that Theorem 1 and Corollary 1 are special cases of Theorems 3 and 5, respectively, when Q is a singleton. This yields the following proof for Theorem 2 as an immediate consequence of the results in Theorems 4 and 5.

Proof

(of Theorem 2). Since Q is a singleton, we have \(\varDelta _\otimes = \varDelta \), \(\varDelta _\odot = \emptyset \) and \(U = \mathscr {U}\). Use these in Theorem 4 for (7) and Theorem 5 for (8), respectively. \(\square \)

3 Illustrative example: construction and strength/utility of main results

Table 1 Illustrative (PG) Dataset

Consider a set of 6 \(+\) observations and 4 − observations in Table 1 that are described in 10 0–1 attributes; that is, \(n = 10\), thus \(N = \{1 \ldots , 10\}\), \(N^\prime = \{11 \ldots , 20\}\) and \(\mathscr {N} = \{1, \ldots , 20\}\). For convenience in presentation, we let \(S^+ = \{1, \ldots , 6\}\) and \(S^- = \{ 7, \ldots , 10\}\).

When analyzed on a graph, these 10 observations comprise 4 stars, namely, \(\mathfrak {S}_1\), \(\mathfrak {S}_2\), \(\mathfrak {S}_3\) and \(\mathfrak {S}_4\), as shown in Fig. 5 and supplemented by Table 1. (We note with \(A_5\) as an example that a \(+\) data can belong to multiple stars.)

For easier referencing, we connect each pair of the internal nodes of the 4 stars by a dashed line and provide \(\varDelta _{kl}\) information (that is, the indices where the two internal nodes of the stars differ) along the dashed edge. For each star \(\mathfrak {S}_k, k \in Q{:=} \{1, 2, 3, 4\}\), the number on an edge between a leaf node and the center is the index where the pair of \(+\) and − observations differ. Therefore, \(\varDelta _k\) in Table 1 for each of the four stars is obtained by simply collecting the indices indicated along the edges of the star.

To illustrate how the inequalities of this paper are generated for this dataset, we first take the 7 edges of the graph and use (6) of Lemma 1 to obtain 7 valid inequalities

$$\begin{aligned}&y_1 \le x_1, \;\; y_2 \le x_2, \;\; y_3 \le x_{17}, \;\; y_4 \le x_3, \;\; \\&y_5 \le x_{15}, \;\; y_5 \le x_{16}, \;\; y_6 \le x_4 \end{aligned}$$

which, respectively, dominate

$$\begin{aligned}&y_1 \le 1 - x_{11}, \;\; y_2 \le 1 - x_{12}, \;\; y_3 \le 1 - x_7, \;\; y_4 \le 1 - x_{13}, \;\; \\&y_5 \le 1 - x_5, \;\; y_5 \le 1 - x_6, \;\; y_6 \le 1 - x_{14}. \end{aligned}$$

The user can verify that the inequalities generated by Lemma 1 are non-dominated by any other inequality, both new and old.

Fig. 5
figure 5

Graph analysis of data in Table 1

To illustrate how to generate a set of inequalities from individual stars, we consider \(\mathfrak {S}_4\), comprised of the internal node \(A_{10}\) and three leaf nodes \(A_3\), \(A_4\) and \(A_5\), hence \(U_4 = \{3, 4, 5\}\). First, we collect the information of the edges of the graph to obtain \(\varDelta _4 = \{3, 5, 7\}\)\((= \bigcup _{i \in U_4} \{ \left. j \in N \, \right| \, A_{ij} \not = A_{10, j} \}\)) and set \(J^\star _4 = J_3 \cap J_4 \cap J_5 \cap J_{10} = \{1,2,4,6,9,10,18\}\) (refer Table 1 for information on \(J_i\), \(i \in U_4\)). Thus, (7) of Theorem 1 yields that

$$\begin{aligned} y_3 + y_4 + y_5 \le 1 - x_j \end{aligned}$$

dominates

$$\begin{aligned} y_i \le 1 - x_j, \;\; \forall i \in U_4 = \{3, 4, 5\} \end{aligned}$$

for every \(j \in J^\star _4 = \{1,2,4,6,9,10,18\}\).

Now, applying Corollary 1 with \(\varDelta _4 = \{3,5,7\}\), we obtain additional inequalities involving \(x_j\), \(j \not \in J^\star _4\). For example, for \(\iota = 3 \in \varDelta _4\), we have \(i_\iota = 4\), as indicated along the edge between \(A_{\iota (=3)}\) and \(A_{10}\) in Fig. 5. Using the information of \(J_4\) in Table 1, one can verify that \(J_{10} \setminus J_4 = \{3\}\). Now, using this in (8) of Corollary 1, we obtain

$$\begin{aligned} y_3 + y_5 \le 1 - x_3 \end{aligned}$$

that is stronger than any McCormick inequality that involves \(x_3\), \(y_3\) and/or \(y_5\).

Moving onto more complex results, the reader can confirm that the 4 stars in Fig. 5 satisfy the conditions of Theorem 3. So, we have \(\mathscr {U} =\)\(U_1 \cup U_2 \cup U_3 \cup U_4\)\(= \{1,2,3,4,5,6\}\) and \(\mathfrak {J} =\)\(J^\star _1 \cap J^\star _2 \cap J^\star _3 \cap J^\star _4\)\(= \{10\}\) and use (10) of Theorem 3 to generate

$$\begin{aligned} y_1 + y_2 + y_3 + y_4 + y_5 + y_6 \le 1 - x_{10} \end{aligned}$$

which not only dominates its McCormick counterparts

$$\begin{aligned} y_i \le 1 - x_{10}, \;\; \forall i \in \{1,2,3,4,5,6\} \end{aligned}$$

but also the inequality from \(\mathfrak {S}_4\) for \(j = 10 \in J^\star _4\) via (7):

$$\begin{aligned} y_3 + y_4 + y_5 \le 1 - x_{10} \end{aligned}$$

The above serves as an example to show that an inequality from a more complex neighborhood structure (hence, a larger neighborhood) is stronger than the one(s) obtained from its sub-structure(s); with exceptions dealt with in Corollary 1, Theorems 5 and 6 (refer Fig. 4).

To obtain a valid inequality involving a different \(x_j\), \(j \not \in \mathfrak {J}\), we use Theorem 5 with \(\varDelta _\otimes =\)\(\bigcup _{k \in \{1, \ldots , 4\}} \varDelta _k\)\(= \{1,2,3,4,5,6,7\}\) (refer Table 1 for \(\varDelta _k\) information) and \(\varDelta _\odot =\)\(\bigcup _{k, l \in \{1, \ldots , 4\}, \, k \not = l} \varDelta _{kl}\)\(= \{5,6,7,8,9\}\) (refer Fig. 5 for \(\varDelta _{kl}\) information along the dashed lines). For example, for \(\iota = 1\)\(\in \varDelta _\otimes \setminus \varDelta _\odot = \{1,2,3,4\}\), we have \(i_\iota = 1\), where \(A_1\) is a leaf node of \(\mathfrak {S}_1\) with the internal node \(A_7\). Thus, we obtain \(\mathscr {U} \setminus \{i_\iota \} = \{2,3,4,5,6\}\) and \(J_7 \setminus J_1 = \{1\}\) from the center and leaf nodes of \(A_1\). Using these in (14) of Theorem 5, we obtain

$$\begin{aligned} y_2 + y_3 + y_4 + y_5 + y_6 \le 1 - x_1 \end{aligned}$$

that dominates

$$\begin{aligned} y_i \le 1 - x_1, \;\; \forall i \in \{2,3,4,5,6\} \end{aligned}$$

and, also, the following star inequalities from \(\mathfrak {S}_3\), \(\mathfrak {S}_4\), and \(\mathfrak {S}_2\) respectively:

$$\begin{aligned} y_2 \le 1 - x_1, \;\; y_3 + y_4 + y_5 \le 1 - x_1, \;\; y_5 + y_6 \le 1 - x_1 \end{aligned}$$

Similarly, with \(\iota \in \{2,3,4\}\), we obtain

$$\begin{aligned} \begin{array}{cccccccccccl} y_1 &{} &{} &{}+ &{}y_3 &{}+ &{}y_4 &{}+ &{}y_5 &{}+ &{}y_6 &{} \; \le 1 - x_2, \\ y_1 &{}+ &{}y_2 &{}+ &{}y_3 &{} &{} &{}+ &{}y_5 &{}+ &{}y_6 &{} \; \le 1 - x_3, \\ y_1 &{}+ &{}y_2 &{}+ &{}y_3 &{}+ &{}y_4 &{}+ &{}y_5 &{} &{} &{} \; \le 1 - x_4, \end{array} \end{aligned}$$

that yield a tighter polyhedral relaxation than the one given by the set of their McCormick counterparts

$$\begin{aligned}&y_1 \le 1 - x_2, \;\; y_3 \le 1 - x_2, \;\; y_4 \le 1 - x_2, \;\; y_5 \le 1 - x_2, \;\; y_6 \le 1 - x_2, \\&y_1 \le 1 - x_3, \;\; y_2 \le 1 - x_3, \;\; y_3 \le 1 - x_3, \;\; y_5 \le 1 - x_3, \;\; y_6 \le 1 - x_3, \\&y_1 \le 1 - x_4, \;\; y_2 \le 1 - x_4, \;\; y_3 \le 1 - x_4, \;\; y_4 \le 1 - x_4, \;\; y_5 \le 1 - x_4, \end{aligned}$$

and a set of star inequalities below:

$$\begin{aligned} \begin{array}{rrrr} y_1 \le 1 - x_2, &{} \quad y_3 + y_4 + y_5 \le 1 - x_2, &{} \quad y_5 + y_6 \le 1 - x_2, &{} \\ y_1 \le 1 - x_3, &{} \quad y_2 \le 1 - x_3, &{} \quad y_3 + y_5 \le 1 - x_3, &{} \quad y_5 + y_6 \le 1 - x_3, \\ y_1 \le 1 - x_4, &{} \quad y_2 \le 1 - x_4, &{} \quad y_3 + y_4 + y_5 \le 1 - x_4, &{} \quad y_5 \le 1 - x_4. \end{array} \end{aligned}$$

Last, for Theorem 6, consider \(\iota = 5 \in \)\(\varDelta _\odot = \{5,6,7,8,9\}\) to obtain \(\mathscr {U}_0^5 = \{1,2,5,6\}\) and \(\mathscr {U}_1^5 = \{3,4\}\). These give rise to two inequalities

$$\begin{aligned} y_1 + y_2 + y_5 + y_6 \le 1 - x_5 \;\; \text { and } \;\; y_3 + y_4 \le 1 - x_{15} \end{aligned}$$

that are stronger than 6 McCormick inequalities involving \(x_5\) and \(x_{15}\) and, also 3 star inequalities.

Similarly, Theorem 6 yields 4 additional pairs (a total of 8) of valid inequalities below for the illustrative dataset under analysis (Note that 3 are McCormick inequalities and see Fig. 4 for a quick reference for when this happens.):

$$\begin{aligned} \begin{array}{cccccccccccl} y_1 &{}+ &{}y_2 &{}+ &{}y_3 &{}+ &{}y_4 &{}+ &{}y_5 &{} &{} &{} \; \le 1 - x_6 \\ &{} &{} &{} &{} &{} &{} &{} &{} &{} &{}y_6 &{} \; \le 1 - x_{16} \\ y_1 &{}+ &{}y_2 &{}+ &{}y_3 &{} &{} &{} &{} &{} &{} &{} \; \le 1 - x_7 \\ &{} &{} &{} &{} &{} &{}y_4 &{}+ &{}y_5 &{}+ &{}y_6 &{} \; \le 1 - x_{17} \\ y_1 &{} &{} &{} &{} &{} &{} &{} &{} &{} &{} &{} \; \le 1 - x_8 \\ &{} &{}y_2 &{}+ &{}y_3 &{}+ &{}y_4 &{}+ &{}y_5 &{}+ &{}y_6 &{} \; \le 1 - x_{18} \\ y_1 &{} &{} &{}+ &{}y_3 &{}+ &{}y_4 &{}+ &{}y_5 &{}+ &{}y_6 &{} \; \le 1 - x_9 \\ &{} &{}y_2 &{} &{} &{} &{} &{} &{} &{} &{} &{} \; \le 1 - x_{19} \end{array} \end{aligned}$$
Table 2 Summary of valid inequalities for (PG) instance on dataset in Table 1 and illustration of benefits of new results

As a summary, we listed all 60 standard, McCormick inequalities, 45 inequalities from individual stars via Lemma 1, Theorem 1 and Corollary 1 in Sect. 2.1, and 22 inequalities from a set of stars via Lemma 1, Theorems 3, 5 and 6 in Sect. 2.2 in the three columns of Table 2, going from left to right, respectively. For easy comparison, we use ‘\(\longrightarrow \)’ in this table for dominance relation to state that the inequality in its tail side is dominated by the one on its head. In brief, this table pictorially depicts benefits of the results of this paper in regard to:

  • algebraic lifting and strengthening of simple, McCormick envelopes for a 0–1 multilinear objective function \(\varphi ^+\) of (PG);

  • a multi-term relaxation of \(\varphi ^+\) (for the example considered, the simultaneous relaxation of all six terms of \(\varphi ^+\)) and strengthening; and

  • a tighter polyhedral relaxation of (PG) in terms of a much smaller number of stronger 0–1 linear inequalities.

Furthermore, the reader confirms the hierarchy of relative strength of the inequalities in Fig. 4 in this table that the more advanced an inequality, the stronger it is; thus, a tightest polyhedral overestimation of \(\varphi ^+\) of (PG) is obtained when a maximum number of maximal stars are exploited for the purpose.

Last, recall (PG) to note that, for \(+\) pattern generation experiments, we need to convexify \(\varphi ^-\) and concavify \(\varphi ^+\). For a numerical demonstration of the utility of the new polyhedral overestimation scheme for \(\varphi ^+\), we thus obtain 0–1 polyhedral relaxations of the (PG) instance generated on the dataset in Table 1 as

$$\begin{aligned} (\underline{\text{ PG }})_\dag : \;\; \max \; \left\{ \left. \displaystyle {\sum _{i = 1}^6 y_i} \; \right| \; \bar{\varphi }_\dag ^+, \; \sum _{j \in J_i} x_j \ge 1 \text{ for } i = 7, \ldots , 10, \; \mathbf x \in \{0,1\}^{20},\,\, \mathbf {y} \in [0,1]^6 \right\} \end{aligned}$$

where \(\bar{\varphi }_\dag ^+\) is the polyhedral concave overestimation of \(\varphi ^+\) obtained by means of the method specified in \(\dag \in \{\texttt {mccormick},\texttt {star},\texttt {stars}\}\), where mccormick stands for the standard McCormick relaxation method while star and stars indicate the improvements made on the McCormick-based model by means of the inequalities from individual stars in Sect. 2.1 and via those from the neighboring stars in Sect. 2.2, respectively. Next, \((\underline{\text{ PG }})_{\texttt {mccormick}}\), \((\underline{\text{ PG }})_{\texttt {star}}\) and \((\underline{\text{ PG }})_{\texttt {stars}}\) were solved by CPLEX 12.8 [19] without utilizing the cutting plane methods of the solver, and information on some important performance indicators is summarized in Table 3. Here, ‘Root Relaxation Gap’ is the % difference between the root LP relaxation value of the corresponding model and its optimum; ‘CPU Ticks’ is time in ticks and ‘BB Nodes’ is the total number of branch-and-bound nodes required by CPLEX for solving these MIP instances.

In brief, the results in Table 3 show the practical side of the mathematically stronger results in this paper well. Particularly, we note that \((\underline{\text{ PG }})_{\texttt {stars}}\) is integral, thus solved at the root node.

Table 3 CPLEX efforts for solving 0–1 equivalents of (PG) for \(+\) pattern generation for data in Table 1

4 Numerical experiments

For general \(\bullet \) pattern generation for \(\bullet \in \{+, -\}\), we can use \(\bar{\bullet }\) to denote the complementary type of \(\bullet \) with respect to \(\{+, -\}\) to obtain a 0–1 linear equivalent of (PG)\(^\bullet \) as

$$\begin{aligned} (\underline{\text{ PG }})^\bullet _\dag : \;\; \max \; \left\{ \left. \displaystyle {\sum _{i \in S^\bullet } y_i} \; \right| \; \bar{\varphi }_\dag ^\bullet , \; \sum _{j \in J_i} x_j \ge 1, \, \forall i \in S^{\bar{\bullet }}, \; \mathbf x \in \{0,1\}^{2n},\,\, \mathbf y \in [0, 1]^{m^\bullet } \right\} \end{aligned}$$

where \(\bar{\varphi }_\dag ^\bullet \) is the polyhedral concave overestimation of \(\varphi ^\bullet \) obtained by any method \(\dag \in \{\texttt {mccormick}, \texttt {stars}\}\) and \(m^\bullet \) is the number of \(\bullet \) data in the pattern generation task. Recall that \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) is a strengthened form of \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) by means of the results in Sect. 2. It is both tighter and smaller than the predecessor, thus is expected to provide better LP relaxation bounds and aid in more efficient LAD pattern generation; that is, a more effective and efficient solution of 0–1 MP in (PG).

To test this, this section runs pattern generation experiments with 12 public data mining datasets from [24], summarized in Table 4. More specifically, we took each dataset in this table and first randomly split it into 3 equal-sized, mutually disjoint partitions of 2/3 of the \(+\) and 2/3 of the − data. Next, we combined 2 of them to obtain 3 sub-datasets, each comprised of exactly 2/3 of the original dataset and differ from the other in 50% of the data for generating a pair of \(+\) and − patterns via \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) and \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\). To assess the efficacy of the new relaxation method more objectively, we use CPLEX without utilizing CPLEX cuts when solving these two 0–1 MIP’s.

When it comes to solving MIP’s, cutting plane methods are a hallmark of efficacy. In note of this, each time \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) was solved, we solved the same instance once more with utilizing CPLEX cuts this second time. This 0–1 pattern generation MIP is referred to as \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick+CPLEX}}\) below. As seen, \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick+CPLEX}}\) is a strengthened version of \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) via cutting plane methods implemented in CPLEX.

Difficulties of mining useful information from real-life data are often caused by the presence of a few ‘hard-to-classify’ data. Thus, for a fair comparison, we repeated the whole 3-fold validation procedure 10 times to test the three 0–1 MIP models in 30 unique pattern generation tasks with each dataset in Table 4.

Table 4 Binary classification datasets from [24]

The relaxation bound of a hard problem at the root node of the branch-and-bound tree is usually poor, generally speaking. Nevertheless, when an advanced MIP solver like CPLEX is utilized, the root node relaxation values are least influenced by the effect of the arsenal of solution rules and tools as well as branch-and-bound options featured in the solver, hence present itself a valid measure by which the relative strength of relaxation models compared can be assessed. Thus, we compare the average root node relaxation gaps by the three MIP pattern generation models to examine their relative tightness in formulation. Table 5 below summarizes the root node relaxation gaps by the three MIP models and provides the information on the %-age of improvements made on \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\), the basis for all comparison, by means of using CPLEX cuts and our new cuts under their sub-columns labelled by ‘†’ and ‘‡’, respectively. All numbers in this table and those below are in format ‘average ± 1 standard deviation’ of 30 results, followed by the minimum and the maximum values inside parenthesis.

To assess how the new inequalities enhance the overall efficiency of solution, we compare CPU time (in ticks) required for solving three 0–1 MIP models in Table 6 and summarize the number of CPLEX branch-and-bound nodes explored in Table 7. Between CPU time and the branch-and-bound nodes, we shall note that the former is a more credible measure of efficacy of a method.

Last, Table 8 summarizes information on the number (\(n_\dag \)) and degree (d) of stars found and the number (\(n_\ddag \)) of neighboring stars found, along with their size (q) and the number of terms/data (\(| \mathscr {U} |\) included in Q’s), for each of the 12 datasets. These numbers can be used to infer practical utilities of the mathematical results of this paper, particularly in multi-term relaxing \(\varphi ^\bullet \). For instance, with information on \(m^+\) and \(\bar{n}_s\), one notes from last number inside parenthesis for the seis dataset that Theorem 3 simultaneously relaxes up to 20 terms of \(\varphi ^+\), which consists of \(113 (m^+)\) terms, each a multiplication of \(35 (\bar{n}_s)\)\((1-x_j)\)’s, and so forth.

For comparison, we recall once more that \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick+CPLEX}}\) and \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) are strengthened forms of \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) by means of CPLEX cutting planes and the valid inequalities of this paper, respectively. Thus, one has the following points readily available for comparison and interpretation of numerical results in Tables 57:

  • The comparative performance between \((\underline{\text{ PG }})^\bullet _\texttt {mccormick}\) and \((\underline{\text{ PG }})^\bullet _\texttt {mccormick+CPLEX}\) is an indicator of the efficacy of cutting plane methods in solving 0–1 pattern generation instances from the 12 datasets;

  • Furthermore, the above serves as a measure of numerical difficulties associated with LAD pattern generation and solving (PG);

  • The comparative results by \((\underline{\text{ PG }})^\bullet _\texttt {mccormick}\) and \((\underline{\text{ PG }})^\bullet _\texttt {stars}\) illustrate benefits of the new multi-term, polyhedral relaxation scheme of this paper for LAD pattern generation, thus for a class of practical 0–1 MP in (PG); and

  • Last, the comparative results by \((\underline{\text{ PG }})^\bullet _\texttt {mccormick+CPLEX}\) and \((\underline{\text{ PG }})^\bullet _\texttt {stars}\) measure a relative superiority of one method over the other in solving the standard 0–1 MIP equivalent of (PG).

In reference to the above, numerical results are self-explanatory, thus we comment only on a few, more important and less obvious issues below.

Table 5 Root node relaxation gap

To begin with, as noted earlier, data mining difficulties are often associated with the presence of (a few) ‘difficult’ data. As real-life data, the 12 datasets used for comparative experiments in this section contain hard-to-classify data. This is evidenced in Table 6 that time required for solving \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) instances follows a right-skewed distribution for each dataset; namely, the right tail (that is, the difference between the max and average values) is about 2-5 times longer than the left tail regardless of the size of the dataset, despite all 30 \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) instances for a dataset feature the same number of constraints and 0–1 variables. Largely, this explains why n-fold cross validation experiments have been adopted as a standard setting for comparative experiments in the data mining literature as well as in this section.

Now, recall that \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) is built on \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) by means of replacing term-wise McCormick envelopes in the latter by a smaller number of stronger inequalities. As practical phenomena can defy mathematical triumphs occasionally, one may not expect \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) to outperform \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) on every pattern generation dataset of the 30 fold experiments for each of the test datasets, moderately speaking. Generally speaking, however, mathematical superiority pervades in practice and translates to benefits. Therefore, in light of the mathematical truth that \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) is not only smaller in size but also tighter than its predecessor, the former is much easier to solve, regardless of the size of the dataset. Tables 67 support this well and clearly demonstrate benefits of the multi-term relaxation method of this paper.

General cutting plane methods have stood the test of time to be effective in MIP solution. However, one sees from comparative results by \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) and \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick+CPLEX}}\) in the Tables 57 that they do not prove to be effective at all for the class of practically important problems dealt with in this paper. Specifically, the use of CPLEX cuts improves the root node bounds for \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) only by a very small fraction (refer Table 5) but requires substantially more CPU time (refer Table 6) for solving the 0–1 MIP model. Plausibly, the increased CPU time requirements may arise from adding cuts giving rise to larger LP relaxations to be solved. It may also due to additional constraints getting in the way of the simple structure of \((\underline{\text{ PG }})^\bullet _\texttt {mccormick}\)—namely, the minimal cover inequalities and simple upper bounding-type constraints via the McCormick relaxation—from being ‘fully’ exploited for efficient solution. Simply, one can see from the comparative results by \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick+CPLEX}}\) that \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) is a challenging global optimization problem.

For the same optimization problem, our new cuts reduce the root node gap by a substantial amount and prove to be significantly more effective than CPLEX cuts on average; to see, compare numbers in columns ‘†’ and ‘‡’ of Table 5. They contribute to faster solution of the 0–1 pattern generation instances, too (refer Table 6). As an acute reader notes, this is not a surprise but a direct result of mathematical discoveries of this paper that yield \((\underline{\text{ PG }})^\bullet _{\texttt {stars}}\) from \((\underline{\text{ PG }})^\bullet _{\texttt {mccormick}}\) by substituting a smaller number of stronger inequalities for standard McCormick inequalities in the latter, basis model. Furthermore, we (refer the reader back to Table 2 to) note that fixing the value of one \(y_i\) in our new inequality to 1 can have a cascading effect on fixing the values of the remaining \(y_j\)’s in the same inequality to 0 (although we doubt this useful property is fully exploited by CPLEX’s default setting). For example, the information for seis in Table 8 reveals that our valid inequalities via Theorem 3 involve up to 20 \(y_i\)’s, obtained from simultaneously relaxing 20 terms of \(\varphi ^+\), that further yield additional inequalities with up to 19 \(y_i\)’s and a different \(x_k\) via Theorems 5 and 6.

Table 6 CPU time (in ticks) required for solution
Table 7 CPLEX branch and bound nodes required for solution
Table 8 Number of stars and neighboring stars utilized

5 Conclusion

This paper dealt with a multi-term, polyhedral relaxation of \(\varphi ^+\) of (PG) in terms of a smaller number of stronger linear inequalities. Toward this goal, we analyzed a set of \(+\) data and a set of − data on a graph and discovered sufficient conditions and tools for obtaining a tight polyhedral overestimator of \(\varphi ^+\) from individual stars and also from sets of stars. We showed that our inequalities are facet-defining of the 0–1 multilinear polytope associated with the McCormick inequalities that they replace. We further showed that the maximum benefit, in regard to the size as well as the tightness of the relaxation model, is realized when a maximal set of maximal stars are exploited for generating new valid inequalities. With experiments on 12 public data mining datasets, we demonstrated practical utilities of the new results in absolute terms and also in comparison with the cutting plane methods for MIP implemented in CPLEX.

Last, a multi-term, polyhedral convexification/relaxation of highly nonlinear, complex multilinear programs is well-known to be difficult. Numerical studies in Sect. 4 demonstrate benefits of such results with the practical pattern generation MP for LAD. Here, we note that the material of this paper can also be used for other optimization problems/functions, including:

  • more general 0–1 multilinear functions of the form \(\phi ^+(\mathbf {x}) {:=} \sum _{i \in m} c_i \prod _{j = 1}^{n_i} x_j\), where \(c_i \in I \! \! R_+\); and

  • linear complementary relations/constraints—that is, functions in 0–1 vectors \(\mathbf {x}\) and \(\mathbf {y}\) with \(\mathbf {x} \cdot \mathbf {y} = 0\) (e.g., [25])

Therefore, one may see the numerical results of this paper as calling more attention from the optimization community on the aforementioned, mathematically classic and practically important research subject in nonlinear optimization.