1 Introduction

The Set Cover problem [6] is a well-known optimization problem consisting in covering a set \( \mathcal {U}={\{e_1,e_2,\dots ,e_m\}}\) of elements (called the universe) using a minimum number of subsets of that universe (called covering sets), taken from a collection \(\mathcal {S}={\{S_1,S_2,\dots ,S_n\}}\). The Set Cover problem can be generalized in many ways. For instance, in the Weighted Set Cover, a cost is assigned to each covering set, and the goal is to cover \(\mathcal {U}\) while minimizing the total cost. Another variant, called Set MultiCover [4], adds a non-negative demand for each element in \(\mathcal {U}\) and requires that each of these elements have to be covered at least as much as requested – in that case, each of the covering sets can be used several times. The latter problem can be further extended to the case where the \(S_i\)s are covering multisets instead of sets: the problem is then called Multiset MultiCover [4]. When the number of times a covering set (resp. multiset) can be chosen is limited, we are in presence of the (Multi)set MultiCover with Multiplicity Constraints [4]. It is also possible to limit the capacity of each covering set. For example, if the covering set \(S_1={\{e_1,e_2,e_3\}}\) is assigned a capacity of 1, it can only cover one of its three elements; however, it can be used several times, e.g. one copy of \(S_1\) can cover \(e_1\), while a second copy can cover \(e_3\); in that case, the associated cost is twice the weight of \(S_1\). The latter problem is called the Capacitated Set Cover problem [1].

In this paper, we introduce and study yet another variant of the Set Cover problem that we call the Exact Subset MultiCover problem (ESM). This problem has biological motivations, that will be briefly explained later. Before that, we formally define ESM, illustrate it on an example, and discuss how it relates to and differs from the above mentioned problems.

figure a

We call covering sets (resp. covering subsets) the elements of \(\mathcal {S}\) (resp. \(\mathcal {S}'\)). For an element \(e_j\in \mathcal {U}\), \(\mathcal {E}_j\) is the set of indices of the covering subsets containing \(e_j\). We say that \(e_j\) is exactly covered if, in the multiset \(({\mathcal {S}',g})\), the sum of the multiplicities of the covering subsets that contain \(e_j\) is equal to \(f(e_j)\). We say that \(e_j\) is covered \(t\ge 1\) times (or simply covered) by a covering subset \(S'_i\) if \(e_j\) belong to \(S'_j\) and \(g(S'_j) = t\). Note that in any solution of ESM, all elements of \(\mathcal {U}\) must be exactly covered.

An example is provided in Fig. 1, with \(\mathcal {U}={\{e_1,e_2,\ldots ,e_6\}}\) (thus \(m=6\)) and the multiplicities in \(\mathcal {U}\) are respectively 3, 10, 4, 6, 8, 3 (e.g. \(f(e_2)=10\)); \(\mathcal {S}={\{S_1,S_2,S_3,S_4\}}\) (thus \(n=4\)) with \(S_1={\{e_1,e_2,e_4,e_5\}}\), \(S_2={\{e_2,e_3,e_5,e_6\}}\), \(S_3={\{e_1,e_2,e_3, e_4\}}\) and \(S_4={\{e_2,e_3,e_4,e_6\}}\).

As can be seen in Fig. 1(right), \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for ESM. First, for each \(1\le i\le 4\), we have \(S'_i \subseteq S_i\). Then, by function g (rightmost column), we have that each element \(e_j\in \mathcal {U}\), \(1\le j\le 6\), is exactly covered. For example, \(e_2\) belongs to \(S'_1\), \(S'_3\) and \(S'_4\) (\(\mathcal {E}_2={\{1,3,4\}}\)), and function g applied to these covering subsets respectively returns 3, 4 and 3, for a total of \(10 = f(e_2)\).

Fig. 1.
figure 1

(Left) An instance \(({({\mathcal {U},f}),\mathcal {S}})\) of ESM, where dots in bold show which elements belong to which covering sets, and where the multiplicity function f of each element in \(\mathcal {U}\) is provided in the bottom row. (Right) This instance is a Yes-instance for ESM, and a solution is provided. The last column corresponds to the multiplicity function g of the covering subsets \(S'_i\)s in the multiset \(({\mathcal {S'},g})\).

ESM has similarities to the Set MultiCover problem, but differs from it in two ways: first, in ESM each element \(e_j\) must be exactly covered while an element must be covered at least a certain number of times in Set MultiCover. Second, in ESM, a covering set \(S_i\) is allowed to only cover a subset of its elements (called the covering subset \(S'_i\)). In that sense, ESM has the same flavor as the Capacitated Set Cover problem; however, it differs from it, as in ESM, the size of each covering subset is not constrained. Moreover, when a given covering set \(S_i\) is used several times in ESM, the same covering subset \(S'_i\) must be used every time, a constraint that is not present in Capacitated Set Cover.

The ESM problem is motivated by proteomics, and more precisely by the protein inference problem. The goal is to infer the protein sequences that are present in a biological sample, using the information provided by mass spectrometry (MS/MS). In a nutshell, each protein is cut in smaller pieces called peptides, before entering the mass spectrometer. The mass spectrometer then outputs a series of spectra, each ideally representing a peptide. Once a spectrum is associated to a peptide (through some dedicated algorithm), we obtain a multiset of peptides corresponding to the initial sample, and the aim is to decide which set of proteins (provided from a databank) corresponds with the peptides at hand (see e.g. [3], or Chap. 16 of [5], for a detailed description). In our setting, each element of \(\mathcal {U}\) represents a peptide, f is the multiplicity function of each peptide, while \(\mathcal {S}\) is the set of proteins of the databank. The ESM problem then consists in inferring the proteins from \(\mathcal {S}\) that are present in the sample, together with their abundance, given the information provided by the multiset (\(\mathcal {U}\) ,f) of peptides.

We will also consider a constrained version of ESM, in which we ask that each element \(e_j\) belongs to exactly one covering subset \(S'_{k_j}\) (in which case we say that \(S'_{k_j}\) exactly covers \(e_j\)). We call this version Exclusive Exact Subset MultiCover (E ESM).

figure b

Note that E ESM is a variant of ESM with additional constraints on the solution, hence, any Yes-instance for E ESM is also a Yes-instance for ESM, and any No-instance for ESM is also a No-instance for E ESM.

In the following, it will be convenient to partition the different multiplicities in \(({\mathcal {U},f})\) into groups, where each element having the same multiplicity belongs to the same group. For example, if \(\mathcal {U}={\{e_1,e_2,e_3\}}\) and the multiplicities given by f are resp. 5, 7 and 5, then, we have two groups: \({\{e_1,e_3\}}\) and \({\{e_2\}}\). We call \(\mathsf {n_g}\) the total number of groups for a given multiset \(({\mathcal {U},f})\). It can be seen that, for any Yes-instance for E ESM, in any solution, a covering subset \(S'_i\) can only contain elements belonging to the same group. If not, this means that some \(S'_i\) covers \(g(S'_i)\) times two elements having different multiplicities by f, which implies that at least one of these two elements is not exactly covered, a contradiction. As an illustration, the instance depicted in Fig. 1 is a No-instance for E ESM, simply because there are 5 groups and only 4 covering sets in \(\mathcal {S}\).

In this paper, we also consider a third problem that we call Maximum Exclusive Exact Subset MultiCover (Max-E ESM), which can be seen as the maximization version of E ESM: the goal is to find a maximum cardinality subset \(\mathcal {U}'\) of \(\mathcal {U}\) such that \((({\mathcal {U}',f}),\mathcal {S})\) is a Yes-instance for E ESM.

figure c

We refer to Fig. 1 for an illustration: Max-E ESM will return a subset \(\mathcal {U}'\) of \(\mathcal {U}\) with \(|\mathcal {U}'|=4\). In other words, it is necessary to remove 2 elements from \(\mathcal {U}\) to satisfy E ESM: indeed, (i) group \({\{e_1,e_6\}}\) with multiplicity 3 needs to be exactly covered by two different covering subsets, as no covering set simultaneously contains \(e_1\) and \(e_6\), and (ii) there are 5 groups and 4 covering sets in \(\mathcal {S}\). Moreover, \((({\mathcal {U}',f}),\mathcal {S})\) with \(\mathcal {U}'={\{e_1,e_2,e_3,e_4\}}\) is a Yes-instance for E ESM (simply take \(S'_i\) to exactly cover \(e_i\) for \(1 \le i \le 4\)).

Table 1. Complexity results for ESM, E ESM and Max-E ESM. Parameters m and n respectively denote the number of covering sets in \(\mathcal {S}\) and the number of elements in \(\mathcal {U}\).

The present paper aims at studying ESM, E ESM and Max-E ESM from an algorithmic viewpoint (see resp. Sects. 23 and 4). Most of our results are summarized in Table 1. Due to space constraints, some of the proofs are omitted. They will be available in the journal version of this paper.

2 The Exact Subset MultiCover Problem

In this section, we focus on the ESM problem. We study its computational complexity, first depending on the number \(\mathsf {n_g}\) of groups in the instance, which allows us to draw the tractability border based on that value (Proposition 1 and Theorem 1); then, depending on the contents of \(\mathcal {S}\) (Theorem 2). We then prove that ESM is fixed-parameterized tractable with respect to \(m=|\mathcal {U}|\) (Proposition 2) and with respect to \(n=|\mathcal {S}|\) (Proposition 3).

Proposition 1

ESM is in P when \(\mathsf {n_g}=1\) (i.e., when f is a constant function).

Theorem 1

ESM is NP-hard even if (i) every covering set \(S_i\) is of cardinality at most 3; (ii) every element of \(\mathcal {U}\) is present in at most 3 covering sets; and (iii) the number \(\mathsf {n_g}\) of groups is equal to 2.

Proof

The proof is by reduction from 3-SAT-3, a constrained variant of SAT in which every clause contains at most 3 literals, and every variable appears at most 3 times in the formula. It has been shown that 3-SAT-3 is NP-hard [7].

Take any instance \(\varPhi = C_1\wedge C_2\wedge \ldots \wedge C_m\) of 3-SAT-3, where each clause \(C_j\) of \(\varPhi \), \(1\le j\le m\), is built from boolean variables taken from \(X={\{x_1,x_2,\ldots ,x_n\}}\). The instance of ESM we build from \(\varPhi \) is as follows: first, the set \(\mathcal {U}\) is the union of two sets \(\mathcal {U}_{cl}\) and \(\mathcal {U}_{neg}\). More precisely, \(\mathcal {U}_{cl}={\{e_1,e_2,\ldots ,e_m\}}\), i.e. \(\mathcal {U}_{cl}\) contains as many elements as there are clauses in \(\varPhi \). Moreover, \(f(e_i)=1\) \(\forall e_i \in \mathcal {U}_{cl}\). The second set \(\mathcal {U}_{neg}\) contains elements of the form \(e_{i,j}\), for all i, j such that clause \(C_j\) contains the negative literal \(\overline{x_i}\). For all elements \(e_{i,j}\) from \(\mathcal {U}_{neg}\), we set \(f(e_{i,j})=3\).

Now, the collection \(\mathcal {S}\) is also the union of two sets, namely \(\mathcal {S}_{var}\cup \mathcal {S}_{neg}\). The first set, \(\mathcal {S}_{var}={\{S_1,S_2,\ldots ,S_n\}}\) contains as many covering sets as there are variables in X. The second set, \(\mathcal {S}_{neg}\), contains covering sets denoted \(S_{i,j}\), for all i and j such that clause \(C_j\) contains the negative literal \(\overline{x_i}\) (if an element \(e_{i,j}\) belongs to \(\mathcal {U}_{neg}\), then a covering set \(S_{i,j}\) belongs to \(\mathcal {S}_{neg}\)). It now remains to describe the contents of each covering set \(S_i\) and \(S_{i,j}\): for any \(1 \le j \le m\), if clause \(C_j\) contains the positive literal \(x_i\) (resp. the negative literal \(\overline{x_i}\)), then \(e_j\) belongs to \(S_i\) (resp. to \(S_{i,j}\)). Moreover, if clause \(C_j\) contains the negative literal \(\overline{x_i}\), then \(e_{i,j}\) belongs to both \(S_i\) and \(S_{i,j}\) – see Fig. 2 for an illustration.

The ESM instance we have built satisfies the constraints listed in the statement of Theorem 1. First, each element belongs to at most 3 covering sets of \(\mathcal {S}\): each element \(e_j\) appears in at most 3 covering sets, since \(e_j\) represents a clause \(C_j\), which is of size at most 3, while each element \(e_{i,j}\) appears in exactly two covering sets, namely \(S_i\) and \(S_{i,j}\). Second, any of the covering sets we build contains at most 3 elements: a covering set \(S_i\) contains one element per occurrence of variable \(x_i\) in \(\varPhi \) (thus at most 3 by definition of 3-SAT-3), while a covering set \(S_{i,j}\) contains exactly 2 elements by construction. Finally, by construction, each element appears 1 time or 3 times in the multiset \(({\mathcal {U},f})\), thus \(\mathsf {n_g}=2\).

Fig. 2.
figure 2

Illustration of our reduction from 3-SAT-3, with \(\varPhi =(x_1 \vee x_2 \vee x_3) \wedge (x_1 \vee x_4 \vee \overline{x_5}) \wedge (\overline{x_2} \vee x_3 \vee x_5) \wedge (x_3 \vee \overline{x_4})\), \(n=5\) and \(m=4\). Clause \(C_2= (x_1 \vee x_4 \vee \overline{x_5})\) corresponds to element \(e_2\) (with \(f(e_2)=1\)). Element \(e_2\) is present in covering sets \(S_1\) and \(S_4\) since \(C_2\) contains the two positive literals \(x_1\) and \(x_4\). Clause \(C_2\) contains one negative literal, \(\overline{x_5}\), hence the existence of element \(e_{5,2}\) in covering set \(S_{5,2}\) (with \(f(e_{5,2})=3\)).

We now show that \(\varPhi \) is a Yes-instance for 3-SAT-3 iff \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for ESM.

(\(\Rightarrow \)) Let \(\varPhi \) be a Yes-instance of 3-SAT-3; thus every clause in \(\varPhi \) is satisfied. For each variable \(x_i\in X\), if \(x_i=\texttt {True}\), then we set \(S'_i = S_i \cap \mathcal {U}_{cl}\) with \(g(S'_i)=1\). If, on the contrary, \(x_i=\texttt {False}\), then we set \(S'_i = S_i \cap \mathcal {U}_{neg}\) with \(g(S'_i)=3\). In case an element \(e_j\) from \(\mathcal {U}_{cl}\) is simultaneously present in several distinct \(S'_i\)s, we remove \(e_j\) from all \(S'_i\)s containing it except one chosen arbitrarily (the covering subsets \(S'_i\)s from which \(e_j\) has been removed remain subsets of \(S_i\)). Now, if for some j, \(e_j\) is not present in any \(S'_i\), this means there exists an \(i'\) such that (i) \(e_j\) is in \(S_{i',j}\), and (ii) \(\overline{x_{i'}}=\texttt {True}\), since \(C_j\) is satisfied. In that case, we set \(S'_{i',j} = {\{e_j\}}\) and \(g(S'_{i',j})=1\). Finally, if, for some i and j, \(e_{i,j}\) is not present in \(S_i\), we set \(S'_{i,j} = {\{e_{i,j}\}}\) and \(g(S'_{i,j}) = 3\). Having done that, we have exactly covered every element from \(\mathcal {U}\). It just remains to show that we introduced no inconsistency on the way. First, for a fixed i, \(g(S'_i)\) cannot be equal to 1 and 3 at the same time, by construction. It thus remains to show that a given \(g(S'_{i,j})\) cannot be equal to 1 and 3 at the same time. For this, recall that \(e_{i,j} \in S'_{i,j}\) (and \(g(S'_{i,j})=3\)) only when \(e_{i,j} \notin S'_i\), i.e. when \(\overline{x_i}\) is present in \(C_j\) with \(x_i=\texttt {True}\). On the other hand, as mentioned above, \(e_j \in S'_{i,j}\) (and \(g(S'_{i,j})=1\)) only when \(x_i=\texttt {False}\). Hence, no inconsistency occurs, and our construction is thus a Yes-instance for ESM.

(\(\Leftarrow \)) Suppose the constructed instance \(({({\mathcal {U},f}),\mathcal {S}})\) of ESM is a Yes-instance. For each covering set \(S_i\) in \(\mathcal {S}_{var}\), if \(f(S'_i) \le 1\) or if \(S'_i = \emptyset \), then we set \(x_i\) to True. Otherwise (i.e. if \(S'_i\) contains only and at least one element of \(\mathcal {U}_{neg}\) and if \(f(S'_i) \ge 2\)), we set \(x_i\) to False. Note first that an element \(e_j\) from \(\mathcal {U}_{cl}\) cannot be simultaneously covered by several distinct covering subsets in \(\mathcal {S}'\), since \(f(e_j)=1\). Let us now show that the above described assignment satisfies the formula \(\varPhi \) from the initial 3-SAT-3 problem. Consider an element \(e_j\): if it is covered by a subset \(S'_i\), this is necessarily with \(g(S'_i)=1\), thus \(x_i=\texttt {True}\), which satisfies \(C_j\) since \(C_j\) contains literal \(x_i\) by construction. Suppose now \(e_j\) is covered by a subset \(S'_{i,j}\), thus with \(g(S'_{i,j})=1\). Then, we know that \(e_{i,j}\) is either covered one time by \(S'_{i,j}\) (if \(e_{i,j} \in S'_{i,j}\)), or not covered by it (if \(e_{i,j} \notin S'_{i,j}\)). In any case, this proves that \(e_{i,j}\) must belong to \(S'_i\) with \(g(S'_i) \in {\{2,3\}}\), since \(e_{i,j}\) is only present in \(S_{i,j}\) and \(S_i\). Hence, we conclude that \(x_i=\texttt {False}\) (since \(g(S'_i) \ge 2\), \(e_{i,j} \in S'_i\) and \(S'_i\) cannot contain any element from \(\mathcal {U}_{cl}\) because \(g(S'_i) \ge 2\)), which satisfies \(C_j\) since the existence of \(e_{i,j}\) implies the presence of literal \(\overline{x_i}\) in \(C_j\).

Corollary 1

Unless ETH fails, there exists a \(\delta > 0\) such that ESM cannot be solved in \(O(2^{\delta \cdot \max (n,m)})\).

Theorem 2

ESM is NP-hard, even if every covering set \(S_i\) is of cardinality 2.

Proof

The proof is by reduction from the NP-hard Partition problem [2, 6]: given a multiset \(\mathcal {Q}={\{q_1,q_2,\ldots ,q_n\}}\) of positive integers such that \(C=\sum _{i=1}^n q_i\) is even, is it possible to partition \(\mathcal {Q}\) in two subsets \(\mathcal {Q}_1\) and \(\mathcal {Q}_2\) such that \(\sum _{q_i\in \mathcal {Q}_1}=\sum _{q_i\in \mathcal {Q}_2}=\frac{C}{2}\) ? Let \(\mathcal {Q}={\{q_1,q_2\ldots ,q_n\}}\) be any instance of Partition. We construct an instance of ESM as follows: first, let \(\mathcal {U}={\{e_1,e_2,\ldots ,e_n, e_{n+1}\}}\) with \(f(e_i)=q_i\) for every \(\ 1 \le i \le n\), and \(f(e_{n+1})=\frac{C}{2}\); let also \(\mathcal {S}={\{S_1,S_2\ldots S_n\}}\) where \(S_i={\{e_i,e_{n+1}\}},\ 1 \le i \le n\). Clearly, in this instance of ESM, every covering set \(S_i\) contains exactly two elements. We now show that \(\mathcal {Q}\) is a Yes-instance for Partition iff \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for ESM.

(\(\Rightarrow \)) Suppose \(\mathcal {Q}\) is a Yes-instance for Partition. Thus, there exists a partition \({\{\mathcal {Q}_1,\mathcal {Q}_2\}}\) of \(\mathcal {Q}\) such that \(\sum _{q_i\in \mathcal {Q}_1}=\sum _{q_i\in \mathcal {Q}_2}=\frac{C}{2}\). For each \(q_i\in \mathcal {Q}_1\), we let \(S'_i={\{e_i,e_{n+1}\}}\) and \(g(S'_i) = q_i\). For each \(q_i\in \mathcal {Q}_2\), we let \(S'_i={\{e_i\}}\) and \(g(S'_i)=q_i\). We can see that each element \(e_i\), \(1 \le i \le n\), is exactly covered by \(S'_i\), whereas \(e_{n+1}\) is covered by some covering subsets of \(\mathcal {S}'\), namely those whose indices correspond to the indices of the \(q_i\)s that are in \(\mathcal {Q}_1\). Since \(\sum _{q_i \in \mathcal {Q}_1} q_i=\frac{C}{2}\), \(e_{n+1}\) is altogether exactly covered. Thus the solution we provide is valid, and \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for ESM.

(\(\Leftarrow \)) Suppose \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for ESM. Hence, every element is exactly covered, and in particular elements \({\{e_1,e_2,\ldots ,e_n\}}\). Since each \(e_i\), \(1 \le i \le n\), appears only in \(S_i\), each \(S'_i\) contains \(e_i\) with \(g(S'_i) = q_i\). This implies that some covering subsets of \(\mathcal {S}'\) collectively exactly cover \(e_{n+1}\). Since \(g(S'_i)=q_i\) for all \(1 \le i \le n\), we conclude that \(\frac{C}{2}\) is the sum of the \(g(S'_{i'})\)s for which \(S'_{i'}\) contains \(e_{n+1}\). Hence, a solution for Partition can be inferred from the observed solution of ESM: if a covering subset \(S'_i\) contains \(e_{n+1}\), assign \(q_i\) to \(\mathcal {Q}_1\), otherwise assign \(q_i\) to \(\mathcal {Q}_2\). We have that \({\{\mathcal {Q}_1,\mathcal {Q}_2\}}\) is a partition of \(\mathcal {Q}\). Moreover, the above argument shows that \(\sum _{q_i \in \mathcal {Q}_1} q_i=\frac{C}{2}\), and thus \(\sum _{q_i \in \mathcal {Q}_2} q_i=\frac{C}{2}\) as well.

The two following results show that ESM is FPT with respect to m, the size of the universe \(\mathcal {U}\); and FPT with respect to n, the size of the collection \(\mathcal {S}\).

Proposition 2

ESM is FPT with respect to parameter \(m=|\mathcal {U}|\).

Proof

First, given that no two covering sets of \(\mathcal {S}\) are identical, and given that all covering sets of \(\mathcal {S}\) are built from \(\mathcal {U}\), we have that \(|\mathcal {S}| \le 2^{|\mathcal {U}|}\), i.e. \(n \le 2^{m}\). Then, consider the following algorithm : (i) generate all possible \(\mathcal {S}' = {\{S'_1,S'_2,\dots ,S'_n\}}\) such that \(S'_i \subseteq S_i\ \forall 1 \le i \le n\); then, for each generated \(\mathcal {S}'\), (ii) solve the resulting problem (see below). Note that it may not be necessary to generate all possible \(\mathcal {S}'\), as the algorithm stops when one solution is found. In the worst case, the size of \(\mathcal {S'}\) is \(O(2^m)\) given that \(|\mathcal {S}'|=|\mathcal {S}|\). Moreover, for each covering subset \(S'_i\), we have \(S'_i \in \mathscr {P}(S_i)\) where \(\mathscr {P}(S_i)\) is the power set of \(S_i\). Since \(|S_i| \le m\), then, \(|\mathscr {P}(S_i)| \le 2^m\), and the number of possibilities in (i) does not exceed \(2^{m2^m}\).

Given any \(\mathcal {S}'\), solving the resulting problem as mentioned in (ii) above consists in finding the multiplicity function g. This problem can be modeled as a system of linear equations where the variables are the \(g(S'_i)\)s, \(1 \le i \le n\). For example, if \(e_1\) belongs to \(S'_1,S'_3\) and \(S'_4\), we create the following equation: \(f(e_1) = g(S'_1) + g(S'_3) + g(S'_4)\). The number of variables (resp. constraints) of such a system is at most \(2^m\) (resp. m). This system is solvable in a time that only depends on the number of variables and equations. Thus the total complexity of our algorithm depends only on m, which proves the result.

Proposition 3

ESM is FPT with respect to parameter \(n=|\mathcal {S}|\).

3 The Exclusive Exact Subset MultiCover Problem

In this section, we focus on the E ESM problem, in which every element of \(\mathcal {U}\) must be exactly covered by a single covering subset of \(\mathcal {S}'\). We start with two positive results, before discussing NP-hardness and fixed-parameterized tractability.

Proposition 4

EESM is in P (i) when \(\mathsf {n_g}=1\); (ii) when \(\mathsf {n_g}=|\mathcal {U}|\) (i.e., each element e of \(\mathcal {U}\) has distinct multiplicity); and (iii) when \(\mathsf {n_g}=|\mathcal {S}|\).

Theorem 3

EESM is in P when every element of \(\mathcal {U}\) belongs to at most two covering sets of \(\mathcal {S}\).

Proof

First note that we can always assume that each element belongs to exactly two covering sets of \(\mathcal {S}\). Indeed, when an element \(e_j\) belongs to exactly one covering set \(S_i\), then necessarily, in any Yes-instance, the chosen \(S'_i\) must exactly cover \(e_j\), and thus the initial instance can be reduced, by removing \(S_i\) and each element \(e\in S_i\) such that \(f(e)=f(e_j)\).

The proof is by equivalence with 2-SAT, which is know to be in P. Given any E ESM instance \(({({\mathcal {U},f}),\mathcal {S}})\) such that every element belongs to two covering sets, we construct an instance \(\varPhi \) of 2-SAT on a set X of variables defined as follows: for each covering set \(S_i \in \mathcal {S}\) and for each element \(e_j \in S_i\), add the boolean variable \(x_{i,j}\) to X. Now we show how to build a 2-SAT formula \(\varPhi \) using variables from X. We build two categories of clauses. The first one, which we call assignment clauses, is as follows: for each element \(e_j\) (thus present in two covering sets, say \(S_{i_1}\) and \(S_{i_2}\)), we create the two following clauses: \(C_j^1 = (x_{i_1,j} \vee x_{i_2,j})\) and \(C_j^2 = (\overline{x_{i_1,j}} \vee \overline{x_{i_2,j}})\). Note that the conjunction of \(C_j^1\) and \(C_j^2\) is equivalent to the XOR expression \((x_{i_1,j} \oplus x_{i_2,j})\): this encodes the fact that every element \(e_j\) must be exactly covered by only one covering subset (either \(S'_{i_1}\) or \(S'_{i_2}\)). The second category, which we call consistency clauses, will encode the fact that a given covering subset \(S'_i\) cannot simultaneously contains two elements \(e_{j_1}\) and \(e_{j_2}\) such that \(f(e_{j_1}) \ne f(e_{j_2})\). Thus, for each covering set \(S_i\), and for each pair of elements \(e_{j_1},e_{j_2}\in S_i\) such that \(f(e_{j_1}) \ne f(e_{j_2})\), we create the following clause: \((\overline{x_{i,j_1}} \vee \overline{x_{i,j_2}})\). Finally, the 2-SAT formula \(\varPhi \) we construct is the conjunction of all assignment clauses and consistency clauses.

It can be seen that \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for E ESM iff \(\varPhi \) is a Yes-instance for 2-SAT. Indeed, suppose \(({({\mathcal {U},f}),\mathcal {S}})\) is a Yes-instance for E ESM and observe a given solution: for each element \(e_j\) exactly covered by (resp. not in) \(S'_i\), set \(x_{i,j}\) to True (resp. to False). In that case, all assignment clauses are satisfied, as any element is exactly covered by one covering subset in \(\mathcal {S}'\). Moreover, consistency clauses are also satisfied, since no two elements with distinct multiplicities in \(({\mathcal {U},f})\) can belong to the same covering subset in a solution to E ESM. Now, suppose \(\varPhi \) is a Yes-instance for 2-SAT and observe a given solution:  if \(x_{i,j}=\texttt{True}\), then, covering subset \(S'_i\) exactly covers element \(e_j\) (with \(e_j \in S'_i\) and \(g(S'_i)=f(e_j)\)), otherwise, it does not contain it (\(e_j \notin S'_i\)). This leads to a positive solution for E ESM since assignment clauses ensure that each element is exactly covered by exactly one subset, while consistency clauses ensure that a covering subset does not contain elements having different multiplicities.

Although we provided in Proposition 4 and Theorem 3 classes of instances for which E ESM is in P, as shown in the following theorem, the problem becomes NP-hard, even if the input instance is constrained. Note that Theorems 3 and 4 collectively draw the border between tractable and intractable instances, when considering parameter “number of covering sets to which an element belongs".

Theorem 4

EESM is NP-hard even if (i) every covering set \(S_i\) is of cardinality at most 3; (ii) every element of \(\mathcal {U}\) is present in at most 3 covering sets of \(\mathcal {S}\); and (iii) the number \(\mathsf {n_g}\) of groups is equal to 2.

Proof

The proof is adapted from proof of Theorem 1, and will only be briefly described here. As in proof of Theorem 1, we reduce from 3-SAT-3, and the reduction is very similar, the only difference being that \(f(e_j)\) and \(f(e_{i,j})\) do not need to be exactly 1 and 3, but just need to be different.

The forward direction is the same as in proof of Theorem 1, since the constructed solution for ESM appears to be a solution for E ESM as well. The reverse direction is based on the same arguments, but appears to be simpler, due to the specificity of E ESM, namely, a covering subset of \(\mathcal {S}'\) cannot contain two elements with different multiplicities.

Because the above theorem is proved similarly as in Theorem 1, the same result as Corollary 1 also applies to E ESM.

Corollary 2

Unless ETH fails, there exists a \(\delta > 0\) such that EESM cannot be solved \(O(2^{\delta \cdot \max (n,m)})\).

Since EESM is NP-hard, it is natural to ask for the existence of moderate exponential-time algorithms. We start with the following proposition – recall that \(n=|\mathcal {S}|\) is the number of covering sets, and \(\mathsf {n_g}\) is the number of groups.

Proposition 5

EESM is in XP parameterized by \(\ell =n-\mathsf {n_g}\).

It can also be easily seen that E ESM is FPT in \(m=|\mathcal {U}|\). Indeed, since \(\mathcal {S}\) does not contain twice the same subsets of \(\mathcal {U}\), we have \(|\mathcal {S}|\le 2^m\). Hence, only the values provided by f do not depend on m. However, by definition of E ESM, these values cannot be split, and thus, only the presence/absence of an element in the covering subsets of \(\mathcal {S}'\) needs to be inferred. Hence, a solution for E ESM can be computed with a time complexity that only depends on m, and thus, E ESM is FPT in m. We also have the following result.

Proposition 6

EESM is FPT with respect to parameter \(n=|\mathcal {S}|\).

Proof

We propose an algorithm performing a search in a bounded tree T of arity at most n and of height at most n. The way the algorithm works is the following: at each step, consider a so far uncovered element e, and guess, among the remaining covering subsets \(S'_i\)s for which \(S_i\) contains e, which one exactly covers e. Moreover, when a covering subset \(S'_i\) is assumed to exactly cover e, we enforce all elements in the same group as e which also belong to \(S_i\) to be also exactly covered by \(S'_i\). Then, we remove all exactly covered elements, as well as \(S_i\), from the instance. The algorithm stops when an uncovered element cannot be exactly covered, or when all elements have been exactly covered. In the latter case, we have a Yes-instance whose solution can be computed by backtracking. The fact that the former case always corresponds to a No-instance derives from the following claim: any (positive) solution to E ESM can be modified into a solution found by our algorithm. Indeed, if a solution \(\mathcal {S}ol\) to E ESM is not found by our algorithm, then necessarily, in \(\mathcal {S}ol\), there exists a group in which two elements, say \(e_1\) and \(e_2\), are exactly covered by two different covering subsets, say \(S'_1\) and \(S'_2\). Moreover, \(e_1\) and \(e_2\) are both present either in \(S'_1\), or in \(S'_2\), or both (otherwise our algorithm would have “tried" the configuration proposed by \(\mathcal {S}ol\)). Consequently, we can modify \(\mathcal {S}ol\) so that both \(e_1\) and \(e_2\) are exactly covered by the same covering subset. It suffices to iterate this argument to prove the above claim, and this also proves that our algorithm is correct.

Now, since each new node in T corresponds to a new instance in which at least one covering set of \(\mathcal {S}\) has been discarded, T is of height at most n. Moreover, the number of guesses at each step, and thus the arity of T, also never exceeds n. Altogether, the size of T is thus in \(O(n^n)\). Given that the time needed to create a new node in T is in O(m), the overall complexity of our algorithm is in \(O(mn^{n})\), and the result follows.

4 The Max-EESM Problem

Clearly, Max-E ESM is NP-hard under the same conditions as E ESM (see Theorem 4), since deciding whether \(\mathcal {U}'=\mathcal {U}\) is equivalent to solving E ESM. Also, Max-E ESM is FPT parameterized by \(m=|\mathcal {U}|\), for the same reasons as the ones developed in Sect. 3, since all computations depend only on m. We now show in Theorem 5 that a 2-approximation algorithm \(\mathcal {A}\) exists for Max-E ESM, and prove in Proposition 7 that the ratio 2 obtained by \(\mathcal {A}\) is asymptotically tight.

Theorem 5

Max-EESM is 2-approximable.

Proof

The algorithm \(\mathcal {A}\) we propose is greedy, and works as follows: at each iteration, choose among the remaining covering sets in \(\mathcal {S}\) the one, say \(S_i\), that contains the most elements of the same group, that we name \({e_{k_1},e_{k_2},\ldots ,e_{k_j}}\). Hence, all these elements have the same multiplicity, say q. Set \(S'_i={\{e_{k_1},e_{k_2},\ldots ,e_{k_j}\}}\), \(g(S'_i) = q\), remove \(S_i\) as well as \({e_{k_1},e_{k_2},\ldots ,e_{k_j}}\) from the instance and iterate. Algorithm \(\mathcal {A}\) stops when all elements have been exactly covered, or when no element can be further exactly covered. For simplicity, we assume that the number of iterations of \(\mathcal {A}\) is exactly n, e.g. by allowing “empty" iterations during which a covering set \(S_i\) is selected with \(S'_i=\emptyset \). We now introduce some notations: let \(Sol^*_i\) (resp. \(Sol_i\)) the set of elements that is exactly covered by \(S'_i\) in an optimal solution (resp. by \(\mathcal {A}\)), and by \(x^*_i\) (resp. \(x_i\)) the cardinality of \(Sol^*_i\) (resp. \(Sol_i\)). For any \(1\le i<i'\le n\), we denote by \(\varDelta _{i}^{i'} = Sol_{i} \cap Sol^*_{i'}\) the set of elements that are exactly covered both by covering subset \(S'_i\) (by \(\mathcal {A}\)) and by covering subset \(S'_{i'}\) (in an optimal solution); we also let \(\delta _{i}^{i'} = |\varDelta _{i}^{i'}|\). Finally, for any \(1\le i\le n\), we denote by \(\varDelta _i = Sol_i \cap (\bigcup _{i'=i+1}^n Sol^*_{i'})= \bigcup _{i'=i+1}^n \varDelta _{i}^{i'}\) the set of elements that are exactly covered both by covering subset \(S'_i\) by \(\mathcal {A}\) and any covering subset \(S'_{i'}\) with \(i' > i\) in an optimal solution. We finally let \(\delta _i= |\varDelta _i|\).

Let \(N^*\) be the number of exactly covered elements in an optimal solution to Max-E ESM, and \(N_{\mathcal {A}}\) the number of exactly covered elements by \(\mathcal {A}\). Clearly, \(N^* = \sum _{i=1}^n x^*_i\) and \(N_{\mathcal {A}}=\sum _{i=1}^n x_{i}\). Our goal is to show that \(N_{\mathcal {A}}\ge \frac{N^*}{2}\), or otherwise stated \(2\sum _{i=1}^n x_{i} - \sum _{i=1}^n x^*_i \ge 0\) (1). First note that for any \(i'\ne i''\), we know that \(Sol^*_{i'} \cap Sol^*_{i''}=\emptyset \), since in E ESM, an element can only belong to a single covering subset of \(\mathcal {S'}\). Thus, for any i and \(i' \ne i''\) with \(\min ({i',i''})>i\), we have \(\varDelta _{i}^{i'} \cap \varDelta _{i}^{i''} = \emptyset \), and consequently, for any \(1\le i\le n\), \(|\bigcup _{i'=i+1}^n \varDelta _{i}^{i'}| = \sum _{i'=i+1}^n \delta _{i}^{i'}\), which in turn gives: \(\delta _{i} = \sum _{i'=i+1}^n \delta _{i}^{i'} \,\,\,\forall \, 1\le i\le n\) (2). Coming back to inequality (1), it is equivalent to: \(2\sum _{i=1}^n x_{i} - \sum _{i=1}^n x^*_i + (\sum _{i=1}^n \delta _{i} - \sum _{i=1}^n \delta _{i}) \ge 0\) (3) which, by equality (2), can also be rewritten as: \(2\sum _{i=1}^n x_{i} - \sum _{i=1}^n x^*_i + (\sum _{i=1}^n \sum _{i'=i+1}^n \delta _{i}^{i'} - \sum _{i=1}^n \delta _{i}) \ge 0\) (4). The left side of the above inequality can be written as \(\mathcal {I}_1+\mathcal {I}_2\), where \(\mathcal {I}_1= \sum _{i=1}^n x_{i} - \sum _{i=1}^n \delta _{i}\) and \(\mathcal {I}_2= \sum _{i=1}^n x_{i} + \sum _{i=1}^n \sum _{i'=i+1}^n \delta _{i}^{i'} - \sum _{i=1}^n x^*_i\). Our goal is thus to show that \(\mathcal {I}_1+\mathcal {I}_2\ge 0\). For this, we will separately show that \(\mathcal {I}_1\ge 0\) and \(\mathcal {I}_2\ge 0\), which will allow us to conclude. First, it can be easily seen that \(\mathcal {I}_1\ge 0\), by noticing that \(x_i \ge \delta _i\) for any \(1\le i\le n\). Indeed, by definition, for any \(1\le i\le n\), \(\delta _{i} = |Sol_i \cap (\bigcup _{i'=i+1}^nSol^*_{i'})|\), and thus \(\delta _{i} \le |Sol_i|\). It suffices to recall that \(x_i=|Sol_i|\) by definition to conclude.

Now let us show that \(\mathcal {I}_2\ge 0\). Recall that, at any iteration \(1\le i\le n\), algorithm \(\mathcal {A}\) chooses a covering subset \(S'_i\) to exactly cover all elements belonging to the most represented group. Thus, we conclude that \(Sol_i\) contains at least as many elements as \(Sol^*_i\), minus those which may have already been exactly covered by a covering subset \(S'_{i'}\), \(i'<i\), in a previous iteration of \(\mathcal {A}\). Hence, \( |Sol_i| \ge |Sol^*_i/ \bigcup _{i'=1}^{i-1} Sol_{i'}| \,\,\, \forall \, 1\le i\le n\) (5). Since for any two sets A and B, we have \(|A/B|=|A|-|A\cap B|\), we know that \(|Sol^*_i/ \bigcup _{i'=1}^{i-1} Sol_{i'}|= |Sol^*_i| - |Sol^*_i \cap \bigcup _{i'=1}^{i-1} Sol_{i'}|\). Since \(x_i=|Sol_i|\) and \(x^*_i=|Sol^*_i|\), from (5) we have: \(x_i \ge x^*_i - |Sol^*_i \cap \bigcup _{i'=1}^{i-1} Sol_{i'}| \,\,\, \forall \, 1\le i\le n\) (6). Given that \(|Sol^*_i \cap \bigcup _{i'=1}^{i-1} Sol_{i'}|= | \bigcup _{i'=1}^{i-1} (Sol^*_{i} \cap Sol_{i'})|\), we can rewrite (6) as follows: \(x_{i} \ge x^*_i - | \bigcup _{i'=1}^{i-1} (Sol^*_i \cap Sol_{i'})| \,\,\, \forall \, 1\le i\le n\) (7). Recall that, by definition, \(\bigcup _{i'=1}^{i-1} (Sol^*_i \cap Sol_{i'}) = \bigcup _{i'=1}^{i-1} \varDelta _{i'}^{i}\). We also know that all \(\varDelta _{i'}^{i}\) are pairwise disjoint, again by definition; and since \(\delta ^i_{i'}=|\varDelta _{i'}^{i}|\), inequality (7)  becomes: \(x_i \ge x^*_i - \sum _{i'=1}^{i-1} \delta _{i'}^{i}\) (8). This inequality is true for any \(1\le i\le n\). Thus, if we sum it for every \(1\le i\le n\), we obtain: \(\sum _{i=1}^{n} x_i \ge \sum _{i=1}^{n} x^*_i - \sum _{i=1}^{n} \sum _{i'=1}^{i-1} \delta _{i'}^{i}\), which we can rewrite: \(\sum _{i=1}^{n} x_i + \sum _{i=1}^{n} \sum _{i'=1}^{i-1} \delta _{i'}^{i} - \sum _{i=1}^{n} x^*_i \ge 0\) (9). We are now ready to conclude, as it can be seen that the left term of (9) is exactly \(\mathcal {I}_2\). For this, it suffices to note that \(\sum _{i=1}^{n} \sum _{i'=1}^{i-1} \delta _{i'}^{i}\) in (9) can also be expressed as follows: \(\sum _{i=1}^{n} \sum _{j=i+1}^n \delta _i^j\). Altogether, this proves \(\mathcal {I}_2\ge 0\). Since we proved \(\mathcal {I}_1\ge 0\), we have \(\mathcal {I}_1+\mathcal {I}_2\ge 0\), which shows that \(N_{\mathcal {A}}\ge \frac{N^*}{2}\).

Proposition 7

The approximation ratio 2 for \(\mathcal {A}\) is tight.

5 Conclusion

In this paper, we have introduced the ESM problem and two of its variants, initially motivated by protein inference in mass spectrometry. For these problems, we have provided several algorithmic results, including computational complexity, parameterized complexity and approximation. Several questions remain open, for instance the following: is ESM strongly NP-hard? What is the complexity of ESM when each multiplicity in \(\mathcal {U}\) is unique? Can the 2-approximation ratio of Max-E ESM be improved, and which inapproximability ratio exists?