1 Introduction

Results obtained from a dataset through the Knowledge Discovery from Data (KDD) process should be statistically validated to ensure that they are not just due to the randomness in the Data-Generating Process (DGP) (Hämäläinen and Webb 2019; Pellegrina et al 2019; Zimmermann 2014): the goal of the analysis is to gain new knowledge about the DGP through the observed dataset, rather than knowledge about the dataset itself.

A rigorous validation approach subjects the results to statistical hypothesis tests (Lehmann and Romano 2022): results that pass the tests are deemed statistically significant,Footnote 1 as they appear to give new information on the DGP.

The significance of the results is assessed against a null model \((\mathcal {Z}, \pi )\), where the null set \(\mathcal {Z}\) is the collection of datasets that the DGP may generate, which are assumed to share some characteristics with the observed dataset (e.g., size, frequency of items, number of simple patterns, ...), and \(\pi\) is a user-specified probability distribution over \(\mathcal {Z}\). The null model captures assumed or existing knowledge about the DGP. Results that are deemed significant under an appropriate null model constitute new knowledge about the DGP.

A null model is partially independent from the task whose results one wants to validate, as it models the generation of datasets, not directly of results, but on the other hand, it is used to evaluate the results of the task, so it needs to be representative of the task. The choice of the null model by the user must therefore be deliberate and informed, as the meaning of “significant” depends on it: results deemed significant under one null model cannot in general be compared to those deemed significant under a different null model. “All models are wrong, but some are useful” (George E. P. Box), and some null models may be more appropriate for testing the significance of the results of a task than others, because they more closely represent the settings of the task. Many null models should be available, capturing different properties of the observed dataset, and users must be informed of their differences, so they can choose the one most appropriate for their needs (Ferkingstad et al. 2015). In this work, we present null models that, we argue, are more appropriate for many data mining tasks from transactional datasets, thus expanding the “library” of models available to practitioners.


Many hypothesis tests are based on resampling (Westfall and Young 1993; Lehmann and Romano 2022): they analyze multiple datasets drawn from the null model in order to approximate the distribution of the test statistic, and then compare the observed value of the statistic against such distribution. Thus, computationally-efficient procedures to sample from the null model distribution are necessary for statistically validating KDD results.

Contributions


We study the problem of evaluating the significance of results from binary transactionalFootnote 2 and sequence datasets, using resampling-based hypothesis tests.

  • We introduce a novel class of null models for these datasets. Our models are Row-Order Agnostic (ROA), i.e., do not consider the order of the rows (i.e., transactions or sequences) in the observed dataset to be fixed. Previous null models were instead Row-Order Enforcing (ROE). We argue that the order of the rows is not meaningful for many KDD tasks on such datasets (e.g., frequent pattern mining, large tile identification), thus ROA null models more closely represent the settings of such tasks. Apart from this difference, ROA models can preserve the same properties (e.g., number of rows, lengths of the rows, item/itemset frequencies, ...) as ROE models.

  • We present ROhAN, a general algorithmic approach for the efficient sampling of datasets from ROA models. Our methods can use existing or future approaches for sampling from ROE models as subroutines (thus building on top of a vast literature), and rely on the Metropolis-Hastings (MH) algorithm when these are based on the Markov-Chain-Monte-Carlo (MCMC) method, and on rejection sampling otherwise. Our procedures can be used in resampling-based hypothesis testing for the validation of KDD results.

  • The results of our experimental evaluation show that ROE and ROA null models are not equivalent, and this difference affects the validation of results. We evaluated ROhAN on real datasets: it is fast, (empirically) rapidly-mixing, and scalable as the dataset grows.

1.1 Related work


Transactional and sequence datasets are a natural representation of data from many areas, from logs, to gene mutations, to temporal events, to athletes’ vitals (Hrovat et al. 2015), to satellite images (Méger et al. 2015). They are extremely common, and many KDD methods for them are available. We focus here on works related to the validation of results from such datasets.

1.1.1 Null models for transactional datasets


The need to evaluate the statistical significance of results obtained from transactional datasets has long been noted (Megiddo and Srikant 1998) and remarked by the KDD community (Zimmermann 2014). A long line of research studied how to discard non-interesting patterns from mined collections, or directly mine patterns w.r.t. different interestingness measures (Vreeken and Tatti 2014). This direction is orthogonal to assessing the statistical significance of the results, but they may be combined (Dalleiger and Vreeken 2022).

Many works focused on finding significant patterns, where the meaning of “significance” is varied. Hämäläinen and Webb (2019) and Pellegrina et al. (2019) survey this area, so we focus on the contributions most relevant to ours.

Gionis et al. (2007) study a ROE null model \((\mathcal {Z}_{\textrm{M}}, \pi )\) for transactional datasets, where \(\mathcal {Z}_{\textrm{M}}\) is the set of all \(I \times J\) binary matrices with the same row and column sums (a.k.a., margins) as the observed dataset.Footnote 3 The problem of how to generate such matrices has been studied in mathematics (Ryser 1963, Ch. 6) (e.g., as the problem of generating bipartite graphs with fixed degree sequences) and statistics (e.g., to sample 2-way \(I \times J\) binary contingency tables) for a long time (Besag and Clifford 1989 Sect. 3), as it has applications to, e.g., ecology (Connor and Simberloff 1979). Gionis et al. (2007) use MCMC approaches to sample from \((\mathcal {Z}_{\textrm{M}}, \pi )\), to assess KDD results. We argue that the output of many KDD tasks (e.g., frequent itemset mining) from transactional datasets is not dependent on the order of the transactions, and null models that do not consider this order fixed, i.e., the ROA models that we introduce, are more representative of the settings of such tasks. Our algorithmic framework ROhAN can use existing and future methods to sample from \((\mathcal {Z}_\textrm{M}, \pi )\) as subroutines to sample from ROA models, thus allowing us to build on top of an extensive literature, discussed in depth by Fout (2022). Recently, Preti et al. (2022) presented a ROA model for transactional datasets preserving the number of caterpillars in the graph corresponding to a transactional dataset. They leverage our Lemma 3.


De Bie (2010) proposes null models \((\mathcal {Z}_{\mathbb {E}}, \pi _{\text {MaxEnt}})\) that preserve properties of the observed transactional dataset in expectation w.r.t. \(\pi _{\text {MaxEnt}}\), rather than exactly, as our ROA models and the ROE model studied by Gionis et al. (2007). The distribution \(\pi _{\text {MaxEnt}}\) over \(\mathcal {Z}_\mathbb {E}\) is the one with the maximum entropy among with the required expectations.Footnote 4 These models are ROE in expectation, thus less appropriate, as argued, for many tasks from transactional datasets, than the ROA models we propose. While requiring the distribution to have maximum entropy may be appropriate in some cases, a user-specified \(\pi\) can incorporate additional existing or assumed knowledge about the DGP in the null model. We therefore do not consider preserving properties in expectation in this work, but developing “in-expectation ROA models”, and efficient procedures to sample from them, is a possible direction for future work.

1.1.2 Null models for sequence datasets


Jenkins et al. (2022, Sect. 2) discuss previous work on assessing results from sequence datasets in depth, so here we only comment on the most relevant.

Tonon and Vandin (2019) introduce two null models for sequence datasets: one that preserves the number of sequences, the number of itemsets participating in each sequence (i.e., the length of the sequence), and the number of times an itemset participates in the sequences (i.e., the multi-support of the itemset), and a more restrictive null model preserving all structure of the observed dataset, except the order of the itemsets participating in each sequence. A more restrictive model is studied by Pinxteren and Calders (2021). All these models are ROE, as is the null model introduced by Jenkins et al. (2022, Sect. 4.2.2.) which preserves the item-lengths of the sequences (i.e., the sums of the lengths of the itemsets participating in them), rather than the lengths. As in the case of transactional datasets, we argue that the order of the sequences in the dataset is not relevant for many KDD tasks, thus motivating our work on ROA models for sequence datasets. When sampling from ROA models that preserve the same properties as these ROE models, ROhAN employs the efficient methods by Jenkins et al. (2022) in combination with rejection sampling.


Gwadera and Crestani (2010) and Low-Kam et al. (2013) present maximum entropy ROE null models for sequence datasets. The comments above about the maximum entropy model by De Bie (2010) apply to these models as well.

1.1.3 Null models for other data


ROE null models have been proposed for database tables (Ojala et al. 2010), and real-valued and mixed-value matrices (Ojala et al. 2008; Ojala 2010). Developing ROA null models, and efficient algorithms to sample from them, is an interesting direction for future work.

2 Preliminaries

We now first define the types of datasets we study, and then discuss the fundamentals of resampling-based statistical hypothesis testing.

2.1 Transactional and sequence datasets

Let \(\mathcal {I}\doteq \{a_1, \dotsc , a_{n}\}\) be a finite alphabet of \(n\doteq \left| {\mathcal {I}}\right|\) items. W.l.o.g., \(\mathcal {I}\doteq \{1, \dotsc , n\}\). An itemset \(A \subseteq \mathcal {I}\) is any non-empty subset of \(\mathcal {I}\).

A transactional dataset \(\mathcal {D}\doteq \{t_1, \dotsc , t_{m} \}\) is a finite bag of \(m\doteq \left| {\mathcal {D}}\right|\) itemsets, which, as elements of \(\mathcal {D}\), are known as transactions. An itemset A appears in transaction t when \(A \subseteq t\). The support \(\sigma _{\mathcal {D}}(A)\) of itemset A in the transactional dataset \(\mathcal {D}\) is the number of transactions of \(\mathcal {D}\) in which A appears, i.e.,

$$\begin{aligned} \sigma _{\mathcal {D}}(A) \doteq \left| {\{ t \in \mathcal {D}\mathrel {:}A \subseteq t \}}\right| . \end{aligned}$$

For example, if we let \(\mathcal {D}= \{\{1, 2, 4\}, \{2, 4\}\}\), then \(\sigma _{\mathcal {D}}(\{2, 4\}) = 2\) since the itemset \(\{2, 4\}\) appears in both transactions in \(\mathcal {D}\), while, e.g., \(\sigma _{\mathcal {D}}(\{1,2\}) = 1\).

A sequence is a finite ordered list (or a vector) of not-necessarily-distinct itemsets, i.e., \(S = \langle {A_1, \dotsc , A_\ell }\rangle\) for some \(\ell \ge 1\), with \(A_i \subseteq \mathcal {I}\), \(1 \le i \le \ell\). Itemsets \(A_i\) participate in S, and we denote this fact with \(A_i \in S\), \(1 \le i \le \ell\). The length \(\left| {S}\right|\) of a sequence is the number of itemsets participating in it. The itemlength \(\left| {S}\right| \doteq \sum _{A_i \in S} \left| {A_i}\right|\) is the total number of items in S. A sequence \(S=\langle {A_1,\dotsc ,A_{\left| {S}\right| }}\rangle\) is a subsequence of a sequence \(T = \langle {B_1, \dotsc , B_{\left| {T}\right| }}\rangle\), or \(S \sqsubseteq T\), if there exists ordered integers \(i_1< i_2< \cdots < i_{\left| {S}\right| }\) such that \(A_1 \subseteq B_{i_1}, A_2 \subseteq B_{i_2}, \dotsc , A_{\left| {S}\right| } \subseteq B_{i_{\left| {S}\right| }}\). Suppose that \(A = \{1, 2, 4\}\) and \(B = \{2, 4\}\), and let \(S = \langle {A, B, B}\rangle\). Then \(A, B \in S\), \(\left| {S}\right| = 3\), and \(\left| {S}\right| = 7\). In addition, if \(T = \langle {A, A, B, C, B}\rangle\), for any itemset C, then \(S \sqsubseteq T\): a possible choice of indices is \(i_1 = 1, i_2 = 2\), \(i_3 = 5\) (or \(i_3 = 3\)), as \(B \subset A\), but other choices are also possible.

A sequence dataset \(\mathcal {D}\) is a finite bag of sequences, which, as elements of \(\mathcal {D}\), are known as seq-transactions. The support \(\sigma _{\mathcal {D}}(S)\) of a sequence S in \(\mathcal {D}\) is the number of seq-transactions in \(\mathcal {D}\) of which S is a subsequence. The support \(\sigma _{\mathcal {D}}(A)\) of an itemset A in \(\mathcal {D}\) is the number of seq-transactions of \(\mathcal {D}\) in which A participates. The multi-support \(\rho _{\mathcal {D}}(A)\) of A in \(\mathcal {D}\) is the number of times that A participates in total in the seq-transactions of \(\mathcal {D}\). For example, if \(\mathcal {D}= \{ \langle {A,B}\rangle , \langle {A,C,A}\rangle , \langle {B,C}\rangle \}\), then \(\sigma _{\mathcal {D}}(A) = 2\) and \(\rho _{\mathcal {D}}(A) = 3\).

In the rest of the work, we use the term “row” to refer to a transaction for transactional datasets, or to a sequence for sequence datasets. We also use the term pattern to refer to an itemset or a sequence respectively, and we denote with \(\mathcal {L}\) the set of all possible patterns. Doing this allows us to define the generic task of frequent pattern mining: given a minimum support threshold \(\theta \in [1, \left| {\mathcal {D}}\right| ]\), the set \(\textsf{FP}_{\mathcal {D}}(\theta )\) of frequent patterns in \(\mathcal {D}\) w.r.t. \(\theta\) is the set of patterns that have support at least \(\theta\) in \(\mathcal {D}\), i.e.,

$$\begin{aligned} \textsf{FP}_{\mathcal {D}}(\theta ) \doteq \left\{ P \in \mathcal {L}\mathrel {:}\sigma _{\mathcal {D}}(P) \ge \theta \right\} . \end{aligned}$$
(1)

Efficient algorithms for finding the frequent patterns exist for both transactional and sequence datasets (Agrawal and Srikant 1994; Pei et al. 2004).

We define transactional and sequence datasets as bags, so the rows in them have no fixed order. Later we discuss ROE models for which the order of the rows in a dataset is considered fixed. In this case, datasets are ordered lists or vectors of rows, and we refer to them as ordered datasets.

2.2 Null models and hypothesis testing

We tailor the presentation of hypothesis testing to the task of evaluating the significance of the size \(\left| {\textsf{FP}_{\mathcal {D}}(\theta )}\right|\) of the collection of frequent patterns. We choose this simple statistically-sound KDD task because it allows for a self-contained presentation that is also accessible to non-experts, rather than describing an arguably more interesting, but certainly more convoluted task such as mining statistically-significant frequent patterns. Both the ROE and the ROA models we discuss can be used to validate any kind of results obtained from transactional and sequence datasets, including mining statistically-significant frequent patterns, evaluating the correlations between different items, and more.

Statistical significance is assessed w.r.t. a user-specified null model, defined on the basis of an observed dataset \({\mathring{\mathcal {D}}}\), given by the user. A null model is a pair \(\Pi \doteq (\mathcal {Z}, \pi )\), where \(\mathcal {Z}\) is a set of datasets, known as the null set, and \(\pi\) is a user-specified probability distribution over \(\mathcal {Z}\). The null set \(\mathcal {Z}\) is such that \({\mathring{\mathcal {D}}}\in \mathcal {Z}\) and \(\mathcal {Z}\) contains all and only datasets that share some user-specified characteristic properties with \({\mathring{\mathcal {D}}}\), i.e., the null model depends on the observed dataset \({\mathring{\mathcal {D}}}\).Footnote 5 For example, the user may want to preserve the number \(\left| {{\mathring{\mathcal {D}}}}\right|\) of rows, and/or the support of single items in \({\mathring{\mathcal {D}}}\), and much more. The user may specify any distribution \(\pi\) over \(\mathcal {Z}\). Choosing which properties of \({\mathring{\mathcal {D}}}\) to preserve, and which distribution to sample from, allows the user to incorporate in the null model existing or assumed knowledge about the DGP, as \(\mathcal {Z}\) is the set of all the datasets that the DGP may generate, and \(\pi\) is the distribution according to which the DGP generates datasets.

The null model is used to understand whether the observed results represent new knowledge about the DGP. Specifically, the goal is understanding how “typical” the results from \({\mathring{\mathcal {D}}}\) are w.r.t. the distribution of the results from datasets sampled from the null model \(\Pi\): if they are not “typical”, the results are considered significant (under \(\Pi\)), i.e., expressing new knowledge about the DGP. For example, if we want to assess whether the number \(\left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right|\) of frequent patterns w.r.t. \(\theta\) in \({\mathring{\mathcal {D}}}\) is significant, we could make the null hypothesis

$$\begin{aligned} H_0 \doteq \text {``} \left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right| = \mathbb {E}_{\mathcal {D}\sim \pi }[ \left| {\textsf{FP}_{\mathcal {D}}(\theta )}\right| ] \text {''}, \end{aligned}$$
(2)

and then perform a statistical hypothesis test to assess whether there is sufficient evidence that this null hypothesis may be false. If so, we reject the null hypothesis and say that the value \(\left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right|\) appears significant.

One way to perform such a test is to approximate the distribution of the statistic of interest (in this case, the number of frequent patterns) by sampling datasets from the null model (Lehmann and Romano 2022, Ch. 17), and then compare the observed statistic \(\left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right|\) to the obtained empirical distribution, as follows. Assume to sample a collection \(\mathcal {T} \doteq \{ \mathcal {D}_1, \dotsc , \mathcal {D}_\ell \}\) of \(\ell\) datasets independently from \(\mathcal {Z}\) according to \(\pi\). The (empirical) p-value \(\tilde{\textsf{p}}({\mathring{\mathcal {D}}}, \mathcal {T})\) is defined as the fraction of datasets in \(\mathcal {T} \cup \{{\mathring{\mathcal {D}}}\}\) with a number of frequent itemsets w.r.t. \(\theta\) that is not smaller than the one observed in \({\mathring{\mathcal {D}}}\), i.e.,

$$\begin{aligned} \tilde{\textsf{p}}({\mathring{\mathcal {D}}}, \mathcal {T}) \doteq \frac{1 + \left|{\{1 \le i \le \ell \mathrel {:}\left| {\textsf{FP}_{\mathcal {D}_i}(\theta )}\right| \ge \left|{\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right| \}}\right|}{1 + \ell } . \end{aligned}$$

Now let \(\alpha \in (0,1)\) be a user-specified acceptable probability of error. If \(\tilde{\textsf{p}}({\mathring{\mathcal {D}}}, \mathcal {T}) \le \alpha\), then we say \(\left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right|\) is significant at level \(\alpha\), which can be interpreted as meaning there is evidence that the null hypothesis from (2) is false and should be rejected. The value \(\alpha\) is the probability of getting a false discovery, i.e., of wrongly declaring the observed results significant.

In most statistically-sound KDD tasks, multiple hypotheses must be tested. For example, in significant pattern mining (Hämäläinen and Webb 2019; Pellegrina et al. 2019), there is one hypothesis per pattern. One then wants guarantees, e.g., on the Family-Wise Error Rate (FWER), i.e., on the probability of making any false discovery. To ensure that the FWER is bounded by an user-specified threshold \(\delta \in (0,1)\), the p-value of each hypothesis to be tested is compared to an adjusted critical value \(\alpha (\Pi , \mathcal {H}, \delta )\), where \(\mathcal {H}\) is the set of the null hypotheses of interest. Resampling approaches for multiple hypothesis testing (Westfall and Young 1993) compute adjusted critical values using datasets sampled according to \(\pi\), and they have been used with success in significant itemset mining (Pellegrina et al. 2019).

This discussion highlights how efficient procedures to draw datasets from \(\mathcal {Z}\) independently according to \(\pi\) are needed for assessing the statistical validity of results obtained from these datasets. Our algorithmic framework ROhAN achieves this goal for ROA models.

3 Row-order-enforcing null models

We now describe ROE null models, i.e., models that consider the order of rows in a dataset to be fixed, thus permuting the order of the rows results, in general, in a different dataset, and we briefly describe the algorithms to sample from them, using existing examples.

3.1 ROE models for transactional datasets

Gionis et al. (2007) define a ROE model \((\mathcal {Z}, \pi )\) where, given an observed ordered dataset \({\mathring{\mathcal {D}}}\), \(\left| {{\mathring{\mathcal {D}}}}\right| = m\), \(\mathcal {Z}\) contains all and only the ordered datasets such that:

  1. 1.

    \(\left| {\mathcal {D}}\right| = \left| {{\mathring{\mathcal {D}}}}\right| = m\), i.e., \(\mathcal {D}\) has the same size, i.e., number \(m\) of transactions, as \({\mathring{\mathcal {D}}}\); and

  2. 2.

    \(\sigma _{\mathcal {D}}(\{a\}) = \sigma _{{\mathring{\mathcal {D}}}}(\{a\})\), for every item \(a \in \mathcal {I}\), i.e., each item has the same support in \(\mathcal {D}\) and \({\mathring{\mathcal {D}}}\); and

  3. 3.

    for \(i = 1, \dotsc , m\), \(\left| {\mathcal {D}[i]}\right| = \left| {{\mathring{\mathcal {D}}}[i]}\right|\), i.e., the transaction at index i of \(\mathcal {D}\) has the same length as the transaction at index i of \({\mathring{\mathcal {D}}}\), for every i.

The distribution \(\pi\) can be any distribution over \(\mathcal {Z}\).Footnote 6 We call ROE models that maintain the three constraints above “Size, Item-Supports, and Length Preserving” (SISLP). All SISLP null models for a given \({\mathring{\mathcal {D}}}\) have the same null set \(\mathcal {Z}\), i.e., they differ only in \(\pi\). De Bie (2010) considers a null model where the SISLP constraints are preserved only in expectation.


The SISLP models are just one example of ROE models for transactional datasets. One can devise others that preserve additional properties of the observed dataset. We take the SISLP models as an example for the whole class, and most of what we say for them can be applied to other ROE models.

3.1.1 Binary matrices and sampling algorithms


ROE models for transactional datasets effectively equate ordered datasets to binary matrices: the (ij) entry of the matrix \(M_\mathcal {D}\) corresponding to the ordered dataset \(\mathcal {D}= [t_1, \dotsc , t_{\left| {\mathcal {D}}\right| }]\) is 1 iff item \(j \in t_i\). Thus, the null set \(\mathcal {Z}\) of SISLP ROE models corresponds to the set \(\mathcal {M}\) of \(m\times n\) binary matrices with fixed column sums and fixed rows sums, which is a classical object of study in mathematics (Ryser 1963, Ch. 6) and statistics (Fout 2022). This identity is extremely convenient, as it allows to reuse existing algorithms that sample from \(\mathcal {M}\) to sample from \(\mathcal {Z}\). Indeed Gionis et al. (2007) describe, among others, also an MCMC algorithm introduced by Besag and Clifford (1989, Sect. 3),Footnote 7 but any algorithm to sample from \(\mathcal {M}\) can be used, and the literature is extensive (Fout 2022), including importance sampling algorithms (Chen et al. 2005) and recent MCMC algorithms (Wang 2020).

We now briefly describe one of the MCMC algorithms by Gionis et al. (2007). For ease of presentation, we assume here that \(\pi\) is the uniform over \(\mathcal {Z}\), i.e., over \(\mathcal {M}\). In Sect.  5.3 we show how ROhAN can use this algorithm as a subroutine to sample from SISLP-like ROA models. The algorithm, which we call SwapRand (for “Swap Randomization”), runs a Markov chain as follows. The state space is \(\mathcal {M}\), and there is an edge from matrix \(M'\) to matrix \(M''\) if there are two row indices \(1 \le r_1, r_2 \le m\) and two column indices \(1 \le c_1, c_2 \le n\) such that \(M'(r_1, c_1) = 1\), \(M'(r_1, c_2) = 0\), \(M'(r_2,c_1) = 0\), \(M'(r_2,c_2)=1\), and \(M''\) can be obtained from \(M'\) by setting \(M'(r_1, c_1) = 0\), \(M'(r_1, c_2) = 1\), \(M'(r_2,c_1) = 1\), and \(M'(r_2,c_2)=0\), i.e., by performing a single swap. When running the Markov Chain, the algorithm chooses a neighbor \(M''\) of the current state \(M'\) uniformly at random from the \(\textsf{nei}(M')\) neighbors of \(M'\), and moves to it with probability \(\min \{1, {\textsf{nei}(M')}/{\textsf{nei}(M'')}\}\), otherwise it stays in \(M'\) (i.e., it follows a self-loop). Gionis et al. (2007), Alg. 2, Thm. 4.3 give procedures to compute \(\textsf{nei}(M)\) for any matrix, and for drawing a neighbor uniformly at random. It is easy to show that the stationary distribution of this Markov chain is uniform over \(\mathcal {M}\). Thus, the algorithm runs the chain for a sufficient number \(\tau\) of steps to ensure that the distribution of the current state is (approximately) the stationary one, and returns the state at time \(\tau\) as a sample. This algorithm is just one example of MCMC methods to sample from \(\mathcal {M}\), and ROhAN is able to use any such algorithm as a subroutine, as we show in Sect. 5.3.

3.2 ROE models for sequence datasets

Sequence data is more complex or richer than transactional data, which makes it possible to define many null models on it, by preserving different properties of the observed dataset \({\mathring{\mathcal {D}}}\). Tonon and Vandin (2019), Pinxteren and Calders (2021), and Jenkins et al. (2022) give ROE models for sequence datasets, and we now describe two of them as examples, but what we say can be applied to others. The first null model \((\mathcal {Z}^{(1)}, \pi ^{(1)})\) is essentially a SISLP model adapted to sequence datasets. \(\mathcal {Z}^{(1)}\) is the set of all and only the ordered datasets \(\mathcal {D}\) such that

  1. 1.

    \(\left| {\mathcal {D}}\right| = \left| {{\mathring{\mathcal {D}}}}\right| = m\), i.e., \(\mathcal {D}\) has the same size, i.e., number \(m\) of seq-transactions, as \({\mathring{\mathcal {D}}}\); and

  2. 2.

    for every itemset A participating in at least one seq-transaction of \({\mathring{\mathcal {D}}}\), it holds \(\rho _{\mathcal {D}}(A) = \rho _{{\mathring{\mathcal {D}}}}(A)\), i.e., the multi-supports of itemsets participating in the seq-transactions are preserved; and

  3. 3.

    for \(i = 1, \dotsc , m\), \(\left| {\mathcal {D}[i]}\right| = \left| {{\mathring{\mathcal {D}}}[i]}\right|\), i.e., the seq-transaction at index i of \(\mathcal {D}\) has the same length as the seq-transaction at index i of \({\mathring{\mathcal {D}}}\), for every i.

The second null model \((\mathcal {Z}^{(2)}, \pi ^{(2)})\) preserves the same properties as the first, and also the additional property that, for \(i = 1, \dotsc , m\), \(\left| {\mathcal {D}[i]}\right| = \left| {{\mathring{\mathcal {D}}}[i]}\right|\), i.e., the seq-transaction at index i of \(\mathcal {D}\) has the same itemlength as the seq-transaction at index i of \({\mathring{\mathcal {D}}}\), for every i.

Jenkins et al. (2022) give efficient, exact algorithms for sampling from these and other ROE models for sequence datasets when \(\pi\) is the uniform distribution. Tonon and Vandin (2019) give an MCMC algorithm (a variant of the one described for the SISLP model for transactional datasets in Sect. 4.1) for the first null model, which can be modified to handle non-uniform distributions, and a similar one can also be devised for the second null model.

4 Row-order-agnostic null models and ROhAN

Here we introduce ROA null models, which consider datasets as bags of rows, i.e., do not fix the order of the rows. We also describe ROhAN, our algorithmic framework for sampling from ROA null models.

4.1 ROA models for transactional datasets

In ROA models for transactional datasets, the 1:1 mapping between datasets and binary matrices is lost, since this equivalence only holds between ordered datasets and binary matrices. We argue that the loss of this elegant identity is completely offset by the advantage of having null models that are more representative of the settings of KDD tasks on these datasets. Consider, for example, the task of mining the frequent patterns \(\textsf{FP}_{\mathcal {D}}(\theta )\) from (1): the definition of this collection does not depend on the order of the transactions in the dataset, and algorithms for finding this collection (e.g., A-Priori, FP-Growth, Eclat) do not rely on the order of the transactions being fixed or being anything but an arbitrary order that the algorithm can choose itself.Footnote 8 In general, whenever the KDD task to be performed is insensitive to the order of the rows in the dataset, in the sense that the output of the task is the same for any permutation of the rows, a ROA model is likely more appropriate than a ROE one. The latter could instead be a better choice when the task output includes, even in a potentially implicit way, the identifiers of the rows. The difference could, at times, be subtle: consider for example the task of finding cluster centers for the rows (i.e., finding points in a space), and evaluating the significance of these centers, versus the task of finding a clustering of the rows (i.e., finding a partitioning of the rows) and evaluating the significance of such clustering or, e.g., the significance of groups of rows being in the same cluster. In the first case, a ROA model seems more appropriate than a ROE model. In the second case, it is necessary to know what rows belong to what cluster in order to perform statistical validation, and to analyze how the clusters, which are subsets of rows, differ across different datasets in the null set, thus making a ROE model more appropriate. We stress again that the choice of the null model is crucial, and the user needs to exercise extreme care in this regard. It is therefore hard to give generic advice about which between a ROE and a ROA model is to be preferred.

Properties of the observed dataset \({\mathring{\mathcal {D}}}\) that can be preserved by ROE models, can also be preserved, with minor modifications in some cases, by ROA models. As an example, we define a ROA SISLP model \((\mathcal {Z}, \pi )\) for \({\mathring{\mathcal {D}}}\), where \(\mathcal {Z}\) contains all and only the unordered datasets such that:

  1. 1.

    \(\left| {\mathcal {D}}\right| = \left| {{\mathring{\mathcal {D}}}}\right| = m\), i.e., \(\mathcal {D}\) has the same size, i.e., number \(m\) of transactions, as \({\mathring{\mathcal {D}}}\); and

  2. 2.

    \(\sigma _{\mathcal {D}}(\{a\}) = \sigma _{{\mathring{\mathcal {D}}}}(\{a\})\), for every item \(a \in \mathcal {I}\), i.e., each item has the same support in \(\mathcal {D}\) and \({\mathring{\mathcal {D}}}\); and

  3. 3.

    if we let \(\mathcal {D}=\{t_1, \dotsc , t_m\}\) and \({\mathring{\mathcal {D}}}= \{\mathring{t}_1, \dotsc , \mathring{t}_m\}\), there is a 1:1 mapping \(\phi\) from \({\mathring{\mathcal {D}}}\) to \(\mathcal {D}\) such that \(\left| {\phi (t)}\right| = \left| {t}\right|\) for every transaction \(t \in {\mathring{\mathcal {D}}}\), i.e., \(\mathcal {D}\) has the same distribution of transaction lengths as \({\mathring{\mathcal {D}}}\);

The first two properties are the same as the first two in the ROE SISLP model from Sect. 4.1, and the third is a straightforward adaptation of the third one. The distribution \(\pi\) can be any distribution over \(\mathcal {Z}\). In Sect. 5.3 we show how to use ROhAN to sample from this model.

We now comment on the differences between ROE and ROA SISLP models. Let \({\mathring{\mathcal {D}}}\) be an observed dataset, and let \(\textsf{ord}({\mathring{\mathcal {D}}})\) be an ordered dataset obtained by fixing an arbitrary order of the transactions of \({\mathring{\mathcal {D}}}\). Consider the null set \(\mathcal {Z}_{\textrm{A}}\) of a ROA SISLP model for \({\mathring{\mathcal {D}}}\) and the null set \(\mathcal {Z}_{\textrm{E}}\) of a ROE SISLP model for \(\textsf{ord}({\mathring{\mathcal {D}}})\). There is a surjective function \(\textsf{un}()\) from \(\mathcal {Z}_{\textrm{E}}\) to \(\mathcal {Z}_{\textrm{A}}\) which maps an ordered dataset to the corresponding unordered one (e.g., \(\textsf{un}(\textsf{ord}({\mathring{\mathcal {D}}})) = {\mathring{\mathcal {D}}}\)). For any \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\) let \(\textsf{c}(\mathcal {D})\) be the number of ordered datasets in \(\mathcal {Z}_{\textrm{E}}\) that \(\textsf{un}()\) maps to \(\mathcal {D}\) (it holds \(\textsf{c}(\mathcal {D}) \ge 1\)). The following lemma shows that the ROE SISLP model \((\mathcal {Z}_{\textrm{E}}, \pi )\) and the ROA SISLP model \((\mathcal {Z}_{\textrm{A}}, \pi )\) are not equivalent, in the sense that one cannot sample an ordered dataset \(\mathcal {D}\) from \(\mathcal {Z}_{\textrm{E}}\) w.r.t. \(\pi\), and consider the unordered dataset \(\textsf{un}(\mathcal {D})\) as a sample from \(\mathcal {Z}_{\textrm{A}}\) w.r.t. \(\pi\).

Lemma 1

There exists an observed dataset \({\mathring{\mathcal {D}}}\) such that, if we let \(\mathcal {D}\) be an ordered dataset drawn uniformly at random from \(\mathcal {Z}_{\textrm{E}}\), then \(\textsf{un}(\mathcal {D})\) is not chosen uniformly at random from \(\mathcal {Z}_{\textrm{A}}\).

Proof

Let \({\mathring{\mathcal {D}}}= \{ \{1, 2\}, \{1, 3\}, \{3\} \}\), and assume, w.l.o.g., that \(\textsf{ord}({\mathring{\mathcal {D}}}) = [ \{1, 2\}, \{1, 3\}, \{3\} ]\), to which corresponds the binary matrix

$$\begin{aligned} M = \left[ \begin{array}{ccc} 1 &{} 1 &{} 0 \\ 1 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 \end{array} \right] . \end{aligned}$$

The matrix

$$\begin{aligned} M' = \left[ \begin{array}{ccc} 1 &{} 0 &{} 1 \\ 1 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 \end{array} \right] \end{aligned}$$

can be obtained from M with a single swap, and it corresponds to the ordered dataset \(\mathcal {D}' = [\{1, 3\}, \{1, 2\}, \{3\}]\), which, being ordered, is different from \(\textsf{ord}({\mathring{\mathcal {D}}})\), but it holds \(\textsf{un}(\mathcal {D}') = {\mathring{\mathcal {D}}}= \textsf{un}(\textsf{ord}({\mathring{\mathcal {D}}}))\). \(\mathcal {Z}_{\textrm{E}}\) thus contains at least two ordered datasets corresponding to the unordered dataset \({\mathring{\mathcal {D}}}\). From the definition of \(\mathcal {Z}_{\textrm{E}}\), it holds that it also contains the ordered dataset \(\mathcal {D}'' = [\{1,3\}, \{1, 3\}, \{2\}]\), with \(\textsf{un}(\mathcal {D}'') = \{ \{1,3\}, \{1, 3\}, \{2\} \}\). It is easy to see that there is no other ordered dataset \(\mathcal {D}''' \in \mathcal {Z}_{\textrm{E}}\) such that \(\textsf{un}(\mathcal {D}''') = \textsf{un}(\mathcal {D}'')\). Thus, if we sample an ordered dataset \(\mathcal {D}\) uniformly at random from \(\mathcal {Z}_{\textrm{E}}\), then there is a higher probability that \(\textsf{un}(\mathcal {D}) = {\mathring{\mathcal {D}}}\) than \(\textsf{un}(\mathcal {D}) = \textsf{un}(\mathcal {D}'')\), and our proof is complete. \(\square\)

Our algorithmic framework ROhAN (Sect. 5.3) returns samples from \(\mathcal {Z}_{\textrm{A}}\) according to \(\pi _{\textrm{A}}\).

4.2 ROA models for sequence datasets

The reasons for considering ROA models for sequence datasets are similar to those we discussed for transactional datasets, i.e., the order of the seq-transactions is not relevant for many KDD tasks on such data. Results similar to Lemma 1 can be obtained for sequential datasets.

The ROE models from Sect. 4.2 can be “converted” in ROA models in a way similar to what we discussed above for SISLP models for transactional datasets. The consequence of this “conversion” is deep: the correctness of the exact sampling algorithms by Jenkins et al. (2022) for these null models depend on their ROE nature, thus they cannot be easily adapted to the ROA models. For example, the algorithm for the first null model considers the observed sequence dataset as a single long vector of itemsets, and samples from the null model by applying to this vector a permutation chosen uniformly at random using the Fisher-Yates algorithm. The key ingredient for the correctness is that the number of permutations resulting in an ordered dataset \(\mathcal {D}\in \mathcal {Z}\) is a constant for all datasets. This property is lost in ROA models, thus new algorithms are needed. In Sect. 5.3 we show that ROhAN is able to build on top of efficient algorithms for ROE models, such as those by Jenkins et al. (2022).

4.3 ROhAN: sampling from ROA models

We now describe ROhAN, our algorithmic framework for sampling from ROA models. ROhAN uses, as subroutines, algorithms to sample from ROE models, thus allowing us not only to to build on the extensive library of such methods, but also to show that it will be possible to adapt to ROA models any algorithm that may be developed in the future for (possibly not-yet-defined) ROE models.

4.3.1 ROhAN-m: using MCMC algorithms for ROE models

We first show ROhAN-m, which essentially “converts” an MCMC algorithm \(\mathcal {A}_{\textrm{E}}\) for a ROE model \((\mathcal {Z}_{\textrm{E}}, \pi _{\textrm{E}})\) to an MCMC algorithm \(\mathcal {A}_{\textrm{A}}\) for a ROA model \((\mathcal {Z}_{\textrm{A}}, \pi _{\textrm{A}})\) which preserve the same properties, up to the distinction about the sequence of row lengths vs. the distribution of row lengths, as in the ROE vs. ROA SISLP models from Sect. 4.1 and 5.1 respectively, or similarly for the ROA versions of the null models for sequence datasets from Sect. 4.2. We impose no assumption on the distributions \(\pi _{\textrm{E}}\) and \(\pi _{\textrm{A}}\) nor on their relationship (e.g., they do not need to be both uniform).

The intuition behind ROhAN-m is that given a Markov chain on \(\mathcal {Z}_{\textrm{E}}\) with stationary distribution \(\pi _{\textrm{E}}\), we can use the Metropolis-Hasting (MH) approach (Mitzenmacher and Upfal 2005, Ch. 10) to convert it to a Markov chain still defined on \(\mathcal {Z}_{\textrm{E}}\) but with stationary distribution \(\zeta = \zeta (\pi _{\textrm{A}})\) so that, if we sample an ordered dataset \(\mathcal {D}\) from \(\mathcal {Z}_{\textrm{E}}\) w.r.t. \(\zeta\), then \(\textsf{un}(\mathcal {D})\) is a sample from \(\mathcal {Z}_{\textrm{A}}\) w.r.t. \(\pi _{\textrm{A}}\). We later derive the appropriate \(\zeta\) to use.

ROhAN-m uses \(\mathcal {A}_{\textrm{E}}\) as a subroutine as follows. Let \(\mathcal {D}\) be the ordered dataset that is the current state of the Markov chain on \(\mathcal {Z}_{\textrm{E}}\) used by algorithm \(\mathcal {A}_{\textrm{E}}\), and let \(\mathcal {D}'\) be an ordered dataset obtained by simulating a step of the Markov chain of \(\mathcal {A}_{\textrm{E}}\) and \(\eta _{\mathcal {D}}(\mathcal {D}')\) be the transition probability from \(\mathcal {D}\) to \(\mathcal {D}'\). The chain used by ROhAN-m will then move to \(\mathcal {D}'\) with probability

$$\begin{aligned} \min \left( \frac{\zeta (\mathcal {D}') \eta _{\mathcal {D}'}(\mathcal {D})}{\zeta (\mathcal {D}) \eta _{\mathcal {D}}(\mathcal {D}')}, 1 \right) , \end{aligned}$$
(3)

and otherwise stays in \(\mathcal {D}\) (i.e., follows a self-loop). The resulting Markov chain has stationary distribution \(\zeta\) (Mitzenmacher and Upfal 2005, Ex. 10.12). ROhAN-m runs this Markov chain starting from \(\textsf{ord}({\mathring{\mathcal {D}}})\). Once the chain has mixed, the algorithm returns \(\textsf{un}(\mathcal {D})\), where \(\mathcal {D}\) is the ordered dataset corresponding to the final state of the chain. We remark that the Markov chain run by ROhAN-m is still defined on \(\mathcal {Z}_{\textrm{E}}\), not on \(\mathcal {Z}_{\textrm{A}}\).

We now move to derive \(\zeta\), and then show the correctness of ROhAN-m. The intuition is that the desired probability \(\pi _{\textrm{A}}\) to sample \(\mathcal {D}\) from \(\mathcal {Z}_{\textrm{A}}\) should be “spread” among the \(\textsf{c}(\mathcal {D})\) ordered datasets in \(\mathcal {Z}_{\textrm{E}}\) that \(\textsf{un}()\) maps to \(\mathcal {D}\). The stationary distribution used by ROhAN-m is then

$$\begin{aligned} \zeta (\mathcal {D}) \doteq \frac{\pi _{\textrm{A}}(\textsf{un}(\mathcal {D}))}{\textsf{c}(\textsf{un}(\mathcal {D}))},\ \text {for}\ \mathcal {D}\in \mathcal {Z}_{\textrm{E}}. \end{aligned}$$
(4)

Theorem 2

ROhAN-m outputs a sample from \(\mathcal {Z}_{\textrm{A}}\) with distribution \(\pi _{\textrm{A}}\).

Proof

A unordered dataset \(\mathcal {D}' \in \mathcal {Z}_{\textrm{A}}\) is output by ROhAN-m iff the algorithm samples an ordered dataset \(\mathcal {D}\) such that \(\textsf{un}(\mathcal {D}) = \mathcal {D}'\). There are \(\textsf{c}(\mathcal {D}')\) such ordered datasets in \(\mathcal {Z}_{\textrm{E}}\), each sampled with probability \(\zeta (\mathcal {D})\) as in (4). Thus, the probability of returning \(\mathcal {D}'\) is exactly \(\pi _{\textrm{A}}(\mathcal {D}')\). \(\square\)

The only missing ingredient is an expression for \(\textsf{c}(\mathcal {D})\), which will depend on the type of the data (sequence vs. transactional), and on the null model, but it does not depend on the fact that we are considering MCMC algorithms in this section: the same expressions we present in this section, can be used also when using rejection sampling, as we discuss in Sect. 5.3.2. For transactional datasets, we give an expression valid for essentially any null model, under a weak general assumption. For sequence datasets, the richer nature of the data, and therefore of the null models, makes deriving such a generic expression impossible, so we show it for the two null models from Sec. 4.2. Obtaining such an expression is really the only necessary additional step needed to use ROhAN-m for other null models.


\(\textsf{c}(\mathcal {D})\) for transactional datasets

We now discuss the computation of \(\textsf{c}(\mathcal {D})\) for transactional datasets. The following result gives an expression for this quantity. It is valid as long as the ROE null set \(\mathcal {Z}_{\textrm{E}}\) contains all possible ordered datasets corresponding to an unordered dataset \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\), which is a very weak assumption, as if that was not the case, it would mean that preserving the ordering of the transactions is important, i.e., a ROE model is appropriate, and a corresponding ROA model would likely not be. The following result has recently been used by Preti et al. (2022) for the same purpose.

Lemma 3

For any dataset \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\), let \(z_\mathcal {D}\) be the maximum length of any transaction in \(\mathcal {D}\). For each \(1 \le i \le z_\mathcal {D}\), let \(T_i\) be the bag of transactions of length i in \(\mathcal {D}\). Let \({\bar{T}}_i = \{\tau _{i,1}, \dotsc , \tau _{i,h_i}\}\) be the set of transactions of length i in \(\mathcal {D}\), i.e., without duplicates. For each \(1 \le j \le h_i\), let \(W_{i,j} \doteq \{ t' \in T_i \mathrel {:}t' = \tau _{i,j} \}\) be the bag of transactions in \(T_i\) (including \(\tau _{i,j}\)) identical to \(\tau _{i,j} \in {\bar{T}}_i\). Then, the number of ordered datasets in \(\mathcal {Z}_{\textrm{E}}\) that are mapped to \(\mathcal {D}\) by \(\textsf{un}()\) is

(5)

Proof

Recall that \(\mathcal {Z}_{\textrm{E}}\) depends on the observed dataset \({\mathring{\mathcal {D}}}\) and on the arbitrary ordering of its transactions in \(\textsf{ord}({\mathring{\mathcal {D}}})\), as the ordering fixes the row-sums \(r_{x}\), \(1 \le x \le m\). In other words, it fixes the row indices of rows corresponding to transactions of length i, \(1 \le i \le z_\mathcal {D}\), of \(\mathcal {D}\). Thus, the number of different ways in which the transactions of \(\mathcal {D}\) can be assigned as the transactions of an ordered dataset in \(\mathcal {Z}_{\textrm{E}}\) is the product, over the transaction lengths, of the number \(q_i\) of different ways in which the transactions in \(T_i\) can be assigned, i.e.,

$$\begin{aligned} \textsf{c}(\mathcal {D}) = \prod _{i=1}^{z_\mathcal {D}} q_i . \end{aligned}$$

Thus, we only have to argue that

$$q_{i} = \left( {\begin{array}{*{20}c} {\left| {T_{i} } \right|} \\ {\left| {W_{{i,1}} } \right|,{\text{ }} \cdots ,\left| {W_{{i,h_{i} }} } \right|} \\ \end{array} } \right),$$

which is true because the multinomial coefficient \(\left( {\begin{array}{c}n\\ k_1,\dotsc ,k_h\end{array}}\right)\) is the number of different permutations of a bag containing n objects such that \(k_1\) objects are indistinguishable among themselves and of type 1, \(k_2\) objects are indistinguishable among themselves and of type 2, and so on (Stanley 2011, Eq. 1.22).Footnote 9\(\square\)

Assume now that ROhAN-m is in state \(\mathcal {D}\), and that \(\mathcal {D}'\) is the proposed state, which is a neighbor of \(\mathcal {D}\). The only use of \(\textsf{c}(\textsf{un}(\mathcal {D}))\) and \(\textsf{c}(\textsf{un}(\mathcal {D}'))\) by ROhAN-m is in the computation of the acceptance probability from (3), as \(\textsf{c}(\textsf{un}(\cdot ))\) appears in the definition of \(\zeta\) from (4). Plugging the r.h.s. of (4) into (3), we obtain

$$\begin{aligned} \min \left( \frac{\textsf{c}(\textsf{un}(\mathcal {D}))}{\textsf{c}(\textsf{un}(\mathcal {D}'))} \frac{\pi _{\textrm{A}}(\textsf{un}(\mathcal {D}'))}{\pi _{\textrm{A}}(\textsf{un}(\mathcal {D}))} \frac{\eta _{\mathcal {D}'}(\mathcal {D})}{\eta _{\mathcal {D}}(\mathcal {D}')}, 1 \right) . \end{aligned}$$

The distribution \(\pi _{\textrm{A}}\) is given in input, and both \(\eta _{\mathcal {D}'}(\mathcal {D})\) and \(\eta _{\mathcal {D}}(\mathcal {D}')\) can be obtained from the \(\mathcal {A}_{\textrm{E}}\) MCMC algorithm used to simulate a step of the underlying Markov chain, so we only need to discuss the computation of the ratio \({\textsf{c}(\textsf{un}(\mathcal {D}))}/{\textsf{c}(\textsf{un}(\mathcal {D}'))}\). We now show that obtaining this ratio can be done without having access to either quantity, not even for the first state \(\mathcal {D}= \textsf{ord}({\mathring{\mathcal {D}}})\).

Using the notation from the statement of Lemma 3, given a transaction \(t \in \textsf{un}(\mathcal {D})\), suppose \(t \in T_i\) for length \(1 \le i \le z_{\textsf{un}(\mathcal {D})}\). Further suppose \(t = \tau _{i, j} \in {\bar{T}}_i\), where \(1 \le j \le h_i\). Let \(\textsf{net}\) be a dictionary that maps each different transaction \(t \in \textsf{un}(\mathcal {D})\) to \(\left| {W_{{i,j}} } \right|\), i.e., the size of the bag of transactions equal to t (including t). This data structure is easy to initialize at the start of ROhAN-m and to keep up to date as the chain evolves. We can then obtain \({\textsf{c}(\textsf{un}(\mathcal {D}))}/{\textsf{c}(\textsf{un}(\mathcal {D}'))}\) as shown in Alg. 1, which leverages the fact that \(\textsf{c}(\textsf{un}(\mathcal {D})) = \textsf{c}(\textsf{un}(\mathcal {D}'))\) if \(\textsf{un}(\mathcal {D}) = \textsf{un}(\mathcal {D}')\) (line 1), and the definition of the multinomial coefficient, to greatly simplify the computation (lines 4–7).

figure a

\(\textsf{c}(\mathcal {D})\) for sequence datasets


We now show two results on \(\textsf{c}(\mathcal {D})\) for the two null models for sequence datasets from Sect. 5.2: Lemma 4 for the first null model, and Lemma 5 for the second. Algorithms similar to Alg. 1 can be devised for these cases. The ideas presented here should be useful to derive similar ones for other null models (Tonon and Vandin 2019; Pinxteren and Calders 2021; Jenkins et al. 2022).

Lemma 4

For any sequence dataset \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\), let \(z_\mathcal {D}\), \(T_i\), \(\bar{T_i}\), and \(W_{i,j}\) be defined as in Lemma 3 (with “seq-transaction” in place of “transaction”). Then,

$$c(\rm{\mathcal{D}}) = \prod\limits_{{i = 1}}^{{z_{\rm{\mathcal{D}}} }} {\left( {\begin{array}{*{20}c} {\left| {T_{i} } \right|} \\ {\left| {W_{{i,1}} } \right|,{\text{ }} \cdots ,\left| {W_{{i,h_{i} }} } \right|} \\ \end{array} } \right)} = \prod\limits_{{i = 1}}^{{z_{\rm{\mathcal{D}}} }} {\frac{{\left| {T_{i} } \right|!}}{{\prod\limits_{{j = 1}}^{{h_{i} }} {\left| {W_{{i,j}} } \right|} !}}} .$$
(6)

The fact that the expression is the same as the one in (5) should not be surprising, as the first null model is essentially a SILSP null model for sequence datasets. The proof is the same as Lemma 3, so we do not repeat it.

For the second null model, the following result holds.

Lemma 5

For any dataset \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\), let \(z_\mathcal {D}\) be as in Lemma 4, and let \(y_\mathcal {D}\) be the maximum itemlength of any seq-transaction in \(\mathcal {D}\). For each \(1 \le i \le z_\mathcal {D}\), \(1 \le j \le y_\mathcal {D}\) let \(T_{i,j}\) be the bag of seq transactions of length i and itemlength j in \(\mathcal {D}\). Let \({\bar{T}}_{i,j} = \{\tau _{i,j,1}, \dotsc , \tau _{i,j,h_{i,j}}\}\) be the set of seq-transactions of length i and itemlength j in \(\mathcal {D}\), i.e., without duplicates. For each \(1 \le k \le h_{i,j}\), let \(W_{i,j,k} \doteq \{ t' \in T_{i,j} \mathrel {:}t' = \tau _{i,j,k} \}\) be the bag of transactions in \(T_{i,j}\) (including \(\tau _{i,j,k}\)) identical to \(\tau _{i,j,k} \in {\bar{T}}_{i,j}\). Then,

$$c(\rm{\mathcal{D}}) = \prod\limits_{{i = 1}}^{{z_{\rm{\mathcal{D}}} }} {\prod\limits_{{j = 1}}^{{y_{\rm{\mathcal{D}}} }} {\frac{{\left| {T_{{ij}} } \right|!}}{{\prod\limits_{{k = 1}}^{{h_{{i,j}} }} {\left| {W_{{i,j,k}} } \right|} !}}} } {\text{ }}$$

The proof is similar to those for Lemma 3 and 4, with the necessary adaptation for the fact that we are considering sets/bags of seq-transactions that depend on both length and itemlength.

4.3.2 ROhAN-r: using rejection sampling

Not all algorithms for sampling from a ROE null model \((\mathcal {Z}_{\textrm{E}},\pi _{\textrm{E}})\) are based on MCMC. E.g., Jenkins et al. (2022) show non-MCMC algorithms to sample from the first and the second null models for sequence datasets from Sect. 4.2 when \(\pi _{\textrm{E}}\) is uniform. We now describe ROhAN-r, which uses rejection sampling (Casella et al. 2004) and such an algorithm \(\mathcal {A}\), to sample from a ROA null model \((\mathcal {Z}_{\textrm{A}},\pi _{\textrm{A}})\) which preserves the same properties of the observed dataset as \((\mathcal {Z}_{\textrm{E}},\pi _{\textrm{E}})\), up to the difference between preserving the sequence of row lengths vs. the distribution of row lengths. \(\mathcal {A}\) could even be an MCMC algorithm, but we saw in Sect. 5.3.1 how to directly “upcycle” such methods with ROhAN-m.

For any unordered dataset \(\mathcal {D}\in \mathcal {Z}_{\textrm{A}}\), let

$$\begin{aligned} \rho (\mathcal {D}) \doteq \sum _{\begin{array}{c} \mathcal {D}' \in \mathcal {Z}_{\textrm{E}}\ \text {s.t.}\\ \textsf{un}(\mathcal {D}') = \mathcal {D} \end{array}} \pi _{\textrm{E}}(\mathcal {D}') \end{aligned}$$
(7)

be the probability that \(\mathcal {A}\) returns an ordered dataset \(\mathcal {D}'\) such that \(\textsf{un}(\mathcal {D}') = \mathcal {D}\). Let \(Q \in \mathbb {R}\) be a constant such that

$$\begin{aligned} Q \rho (\mathcal {D}) \ge \pi _{\textrm{A}}(\mathcal {D}),\ \text {for any}\ \mathcal {D}\in \mathcal {Z}_{\textrm{A}} . \end{aligned}$$
(8)

ROhAN-r applies rejection sampling by first generating an ordered \(\mathcal {D}' \in \mathcal {Z}_{\textrm{E}}\) using \(\mathcal {A}\), and then generating \(u \sim \mathcal {U}(0,1)\). If

$$\begin{aligned} u \le \frac{\pi _{\textrm{A}}(\textsf{un}(\mathcal {D}'))}{Q \rho (\textsf{un}(\mathcal {D}'))} \end{aligned}$$
(9)

then \(\textsf{un}(\mathcal {D}')\) is returned as a sample from \(\mathcal {Z}_{\textrm{A}}\) distributed according to \(\pi _{\textrm{A}}\). Otherwise, a new \(\mathcal {D}' \in \mathcal {Z}_{\textrm{E}}\) is generated using \(\mathcal {A}\), and the process continues. The correctness of ROhAN-r follows from the properties of rejection sampling and of the algorithm \(\mathcal {A}\).

The derivation of an expression for the constant Q, which depends on the ROA and ROE null models, but not on the algorithm \(\mathcal {A}\), is the only missing ingredient needed to apply ROhAN-r, thus it is left to the user or to the ROE/ROA algorithm designer.

There are even cases when the actual value of Q is not needed, as it partially cancels out in the ratio on the r.h.s. of (9). We now show how that is the case for the two null models for sequence datasets from Sect. 4.2 when \(\pi _{\textrm{E}}\) and \(\pi _{\textrm{A}}\) are the uniform distribution and the algorithms to sample from the ROE models are those by Jenkins et al. (2022).

Indeed, in these cases we have that \(\rho (\mathcal {D}) = {\textsf{c}(\mathcal {D})}/{\left| {\mathcal {Z}_{\textrm{E}}}\right| }\), where \(\textsf{c}(\mathcal {D})\) is either from Lemma 4 or Lemma 5 depending on the null model we are considering. It also holds \(\pi _{\textrm{A}} = {1}/{\left| {\mathcal {Z}_{\textrm{A}}}\right| }\). We define \(Q \doteq {\left| {\mathcal {Z}_{\textrm{E}}}\right| }/{\left| {\mathcal {Z}_{\textrm{A}}}\right| }\),Footnote 10 which clearly is such that the requirement from (8) is satisfied. Then, we have that the condition from (9) can be rewritten as

$$\begin{aligned} u \le \frac{1}{\textsf{c}(\textsf{un}(\mathcal {D}'))}, \end{aligned}$$

which is readily computable from Lemma 4 or Lemma 5.

4.4 Discussion

One may wonder whether “wrapping” existing algorithms for ROE models (whether MCMC or not) to obtain algorithms for ROA models, like ROhAN does, is the correct approach, versus creating methods that directly sample from a set of unordered datasets. We already argued that one of the advantages, and not a small one, of taking the approach we followed, is that one can reuse the large variety of algorithms available (e.g., for ROE SISLP models, i.e., for sampling binary matrices with fixed row- and column-sums, the literature is extensive (Fout 2022)), and even ones that will be developed in the future. Here we want to briefly discuss, through an example, a non-immediately-apparent drawback of “direct sampling” methods for ROA models. Suppose that we want to develop an MCMC algorithm DirectROA for ROA SISLP models, using a Markov chain whose states are the unordered datasets in \(\mathcal {Z}_{\textrm{A}}\) (and not the ordered datasets in \(\mathcal {Z}_{\textrm{E}}\), as in ROhAN-m). We can define the neighborhood structure of the Markov chain by introducing a ROA variant of the swap operation used by the MCMC algorithm for sampling from ROE SISLP models (described in Sect. 4.1): there is an edge from \(\mathcal {D}'\) to \(\mathcal {D}''\) if the latter can be obtained from the former by swapping a pair of items between two transactions that are not one a subset (proper or improper) of the other, and each of which contains only one of the two items. The fact that such operation can be easily defined and implemented, and that it should be easy to draw one such swap uniformly at random to choose a neighbor of the current state to propose as the next step, may lead us to believe that we are on the right track. Additionally, it would seem that a smaller, well-connected, state space could lead to a faster mixing time of the chain. The issue is that, differently from what happens for ROE swaps, there may be multiple ROA swaps from \(\mathcal {D}'\) to \(\mathcal {D}''\), and that the number of ROA swaps from a dataset to any different dataset (i.e., the ROA swaps that would lead from \(\mathcal {D}'\) to a \(\mathcal {D}'' \ne \mathcal {D}'\)) may also be different for different unordered datasets, as it depends on quantities such as the number of identical transactions in \(\mathcal {D}'\). The algorithm would need to compute these two quantities at every step, for both the current state and the proposed next state, as their ratio is the neighbor sampling probability \(\eta _{\mathcal {D}'}(\mathcal {D}'')\), which is needed to obtain the acceptance probability as in (3). While computing these quantities is possible, it requires maintaining additional data structures and additional computational time at every step, for no clear advantage. We implemented such algorithm DirectROA, and we compare ROhAN-m to it in Sect. 6, showing how ROhAN-m performs better in practice.

Extending our approach to null models that preserve constraints (including the row order) in expectation (De Bie 2010), whether using maximum entropy or not, seems challenging, as it requires to derive the probability from (7), which does not seem straightforward in many cases. This is a very interesting direction for future work.

One limitation of this work is that we do not show an upper bound to the mixing time of the Markov chain run by ROhAN-m, i.e., the number of steps needed for the distribution of the current state to be (approximately) the stationary distribution (Mitzenmacher and Upfal 2005, Ch. 10 ). Using the MH approach makes such a derivation particularly challenging (e.g., is not available for SwapRand either), and in any case it would depend on the nature of the Markov chain used by the ROE sampling algorithm that ROhAN-m uses as a subroutine. We measure the mixing time empirically in Sect. 6.

Table 1 Dataset statistics: number of transactions \(\left| \rm{\mathcal{D}} \right|\), number of items \(\left| {\mathcal {I}}\right|\), density \({\text {avg} \left| {t}\right| }/{\left| {\mathcal {I}}\right| }\), where \(\text {avg}\left| {t}\right|\) is the average transaction length, sum \(w\doteq \sum _{i=1}^m \left| {t_i}\right|\) of transaction lengths, support threshold \(\theta\) used in some experiments, and number of frequent itemsets w.r.t. \(\theta\)

5 Experimental evaluation


Our experimental evaluation focuses on three aspects. First, assessing the difference between ROE and ROA models, showing also how it can impact the validation of results from datasets. Second, measuring the speed and scalability of ROhAN-m by measuring its step time, i.e., the time to take a step on the Markov chain, and how it changes as the number \(\left| {{\mathring{\mathcal {D}}}}\right|\) of transactions in the datasets grows. Third, empirically estimating the mixing time of ROhAN-m, i.e., the number of swaps for the distribution of the chain state to be close to the stationary distribution. We do not report on the empirical performance of ROhAN-r because it would mostly be an assessment of that of the underlying algorithm used before the rejection sampling step.

5.1 Implementation, environment, datasets


All the algorithms and experiments are implemented in Java 8, and available from https://github.com/acdmammoths/ROhAN-code, together with instructions and a script to reproduce all our results and figures. We run our experiments on an x86–64 AWS EC2 instance with the Amazon Linux 2 OS, 128GB of RAM, and 32 vCPUs. We use the following five publicly availableFootnote 11 binary transactional datasets, whose relevant statistics are in Table 1:

  • Foodmart: customer transactions from a retail store.

  • Chess: a conversion of the UCI chess (King-Rook vs. King-Pawn) dataset, whose transactions represent chess board configurations.

  • Mushroom: a conversion of the UCI mushroom dataset, whose transactions describe different mushrooms using binary features.

  • BMS WebView 1 (BMS 1): click-stream data from a webstore used in KDD-Cup 2000, which has been prepared for itemset mining.

  • BMS WebView 2 (BMS 2): click-stream data from a webstore used in KDD-Cup 2000, which has been prepared for itemset mining.

Table 2 Difference between models: Minimum, 1st quartile, median, 3rd quartile, and maximum of \(\ln (\textsf{c}(\textsf{un}(\mathcal {D})))\) across 10,000 states \(\mathcal {D}\in \mathcal {Z}_{\textrm{E}}\)

5.2 Difference between ROE and ROA null models


Consider the ROE SISLP null model \((\mathcal {Z}_{\textrm{E}}, \pi _\textrm{E})\) for transactional datasets from Sect. 4.1, with \(\pi _\textrm{E}\) being the uniform distribution over \(\mathcal {Z}_{\textrm{E}}\), and consider the ROA SISLP model \((\mathcal {Z}_{\textrm{A}}, \pi _{\textrm{A}})\) from Sect. 5.1, with \(\pi _{\textrm{A}}\) being the uniform over \(\mathcal {Z}_{\textrm{A}}\). In Lemma 1 we showed an example of an observed dataset \({\mathring{\mathcal {D}}}\) for which sampling a dataset \(\mathcal {D}\) from \(\mathcal {Z}_{\textrm{E}}\) uniformly at random does not imply that \(\textsf{un}(\mathcal {D})\) is a uniform sample from \(\mathcal {Z}_{\textrm{A}}\). The example was artificial, so we want to evaluate the situation on real datasets. Indeed, if there was a constant C such that \(\textsf{c}(\textsf{un}(\mathcal {D})) = C\) for every \(\mathcal {D}\in \mathcal {Z}_\textrm{E}\), then sampling \(\mathcal {D}\) from \((\mathcal {Z}_{\textrm{E}}, \pi _{\textrm{E}})\) and then considering the unordered dataset \(\textsf{un}(\mathcal {D})\) would be equivalent to sampling from \((\mathcal {Z}_{\textrm{A}}, \pi )\), implying that the two null models are effectively the same, and perhaps suggesting that the definition of ROA models is not very interesting. The results of our experimental evaluation show instead that, even in this very simple case, ROE and ROA models are very different.


Our experiment performs a (non-covering) random walk over \(\mathcal {Z}_{\textrm{E}}\), and computes the value \(\textsf{c}(\textsf{un}(\mathcal {D}))\) for each visited state \(\mathcal {D}\). While a random walk may visit a state more than once, it never happened in our experiments. The random walk bias towards higher-degree states has no impact on whether \(\textsf{c}(\textsf{un}(\mathcal {D}))\) is a constant. We report in Table 2 the distribution over 10,000 steps of \(\ln (\textsf{c}(\textsf{un}(\mathcal {D})))\) (we report the logarithms because the raw quantities are truly “astronomical”). Clearly, \(\textsf{c}(\textsf{un}(\mathcal {D}))\) is all but a constant: there are datasets in \(\mathcal {Z}_{\textrm{A}}\) which have \(\approx e^{5000} \approx 10^{2470}\) times more equivalent ordered datasets in \(\mathcal {Z}_{\textrm{E}}\) than other datasets in \(\mathcal {Z}_{\textrm{A}}\), as can be seen by considering the difference between the maximum and minimum entries for BMS 1 or BMS2, and noting that this difference is the natural logarithm of the ratio between the minimum and maximum raw values. Even in the smallest case (Chess), the raw ratio between the minimum and maximum is more than \(e^{10}\). Thus ROE and ROA null models are quite different, i.e., ROA models are a new addition to the library of available null models for statistically-sound KDD.

5.3 Impact of null model choice on statistical validation of results


Using a ROA vs. a ROE model may lead to different outcomes in the validation of results obtained from a transactional dataset. We used ROhAN-m (with SwapRand as subroutine) and SwapRand to respectively compute the significant frequent itemsets (Hämäläinen and Webb 2019; Pellegrina et al. 2019) under a ROA and a ROE model. The two returned sets of significant patterns in Chess, with FWER \(\delta =0.05\), were extremely different, with a Jaccard index of 0.12. This fact should not be surprising, as from the difference highlighted in the previous experiment, one should expect that the (empirical) distributions of the test statistics under the two null models would be very different, and therefore so would be the empirical p-values which are used for the tests. Once more, this result is evidence that the user must be extremely cautious in choosing the assumed null model: the meaning of significance depends on the null model, and it is not meaningful to compare results obtained under different null models (e.g., to compare the statistical power of two procedures).

Table 3 Step time (in ms): minimum, 1st quartile, median, 3rd quartile, and maximum over 10,000 steps

Fig. 1
figure 1

Scalability results: The step time distribution (in milliseconds) over 10,000 swaps for increasing values of \(\left| {{\mathring{\mathcal {D}}}}\right|\). The line in each box corresponds to the median, the bottom and top of each box correspond to the first and third quartiles, and the lower and upper whiskers correspond to the minimum and maximum

Fig. 2
figure 2

Convergence results: \(\textsf{ARSD}({\mathring{\mathcal {D}}})\) as the swap number multiplier \(k\) grows, where \(k\) is s.t. the number of swaps is \(s\doteq \lfloor k\sum _{i=1}^m \left| {t_i}\right| \rfloor\)

5.4 Step times

The step time is the time needed to obtain a valid swap, compute the MH acceptance probability, and transition to the next state if it is accepted. In Table 3 we report the distribution, over 10,000 steps, of this quantity for three algorithms: ROhAN-m, SwapRand, and the “direct” sampling algorithm DirectROA described in Sect. 5.4. We show the results for SwapRand only for comparison purposes: SwapRand is not to be preferred just because it appears faster, as it samples from a ROE model while the other two algorithms sample from a ROA model.

The distribution for ROhAN-m is comparable to that of SwapRand, while DirectROA is slightly slower. This is expected since the execution of SwapRand and ROhAN-m are very similar, where the only additional work for ROhAN-m is to compute the ratio of \(\textsf{c}(\textsf{un}(\mathcal {D}))\) to \(\textsf{c}(\textsf{un}(\mathcal {D}'))\) using Alg. 1. DirectROA is slower, which may seem a bit surprising because one may think that sampling “directly” from the desired space of non-ordered datasets may be more efficient. On the contrary, as discussed in Sect. 5.4, “moving” over this space, as the Markov chain of DirectROA does, requires additional computation, which becomes relatively expensive when many transactions have the same length, as in Chess and Mushroom. We find this fact to be an non-intuitive algorithmic observation, which reinforces the appropriateness of the approach taken by ROhAN-m, i.e., reusing existing algorithms for ROE models.

5.5 Scalability

We use the IBM Quest generator to create synthetic datasets with \(\left| {{\mathring{\mathcal {D}}}}\right| \in \{\)5,000, 10,000, 15,000, 20,000\(\}\), on \(\left| {\mathcal {I}}\right| =100\) and average transaction length \(\left| {t}\right| =25\).Footnote 12 We run all algorithms for 10,000 swaps on each dataset, and report the results in Fig. 1. There is a linear relationship between the distribution of step times and the number of transactions, as all algorithms need to compute the number of neighbors for the proposed next state, which takes time linear in \(\left| {{\mathring{\mathcal {D}}}}\right|\). The interquartile range (\(Q3 - Q1\)) grows in absolute terms because the individual step times grow, but it is essentially constant in relative terms.

5.6 Convergence to the stationary distribution

Since we cannot prove an upper bound to the mixing time of the Markov chain used by ROhAN-m (see Sect. 5.4), we empirically estimate it. Following other works (Tonon and Vandin 2019), we track the Average Relative Support Difference (ARSD), defined as follows, as a proxy for the mixing time: it is assumed that when this quantity stabilizes, the chain has mixed. Given the observed dataset \({\mathring{\mathcal {D}}}\), let \(\theta \in [1, \left| {{\mathring{\mathcal {D}}}}\right| ]\) be a minimum support threshold, and \(\mathcal {D}_s\) be the dataset corresponding to the state of the chain after \(s\in \mathbb {N}\) swaps. Then,

$$\begin{aligned} \textsf{ARSD}(\mathcal {D}_s) \doteq \frac{1}{\left| {\textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )}\right| } \sum _{A \in \textsf{FP}_{{\mathring{\mathcal {D}}}}(\theta )} \frac{\left| {\sigma _{{\mathring{\mathcal {D}}}}(A) - \sigma _{\mathcal {D}_s}(A)}\right| }{\sigma _{{\mathring{\mathcal {D}}}}(A)} . \end{aligned}$$

Fig. 2 shows \(\textsf{ARSD}(\mathcal {D}_s)\) for \(s\doteq \lfloor kw\rfloor\) swaps, where \(k\in \{0, 0.25, 0.50, \dotsc , 2, 3, 4, 5\}\) and \(w\doteq \sum _{i = 1}^{m} \left| {t_i}\right|\), for \(t_i \in {\mathring{\mathcal {D}}}\). We use the values of \(\theta\) from Table 1: the qualitative results do not change with other values.

We remark that comparing the mixing times of Markov chains with different stationary distributions (as SwapRand and ROhAN-m) is meaningless, as they allow to sample different objects from different sets according to different distributions. Neither are the values of the ARSD comparable, as only the stabilization of the ARSD is a proxy for the mixing time, but its value is not a proxy for the distance between the state distribution and the stationary distribution. Therefore, we do not make such comparisons and only include the results from SwapRand for completeness (the mixing time for SwapRand is the same observed by Gionis et al. (2007), Sect. 5.1). On BMS 1, the ARSD converges to a different value for SwapRand, which we take as another indication that ROE and ROA models are different.

Figure 2 shows that in all cases, the ARSD stabilizes by \(s= 2 w\) swaps or earlier (by \(s= w\)), i.e., the mixing time appears to be linear in \(w\). For Chess, the fluctuations in the ARSD may seem large due to the scale of the y-axis, which is much smaller in Fig. 2b than in the other subfigures. The fact that DirectROA requires approximately the same number of steps as ROhAN-m to converge, combined with the fact that each step of DirectROA takes longer (Table 3 and 1), support the design decisions behind ROhAN-m, as we argued in Sect. 5.4.

6 Conclusion

We introduce a novel type of null models for transactional and sequence datasets, which is Row-order Agnostic (ROA), i.e., does not consider the order of the rows as fixed in the original dataset. These null models expand the collection of null models available to the users to test the significance of results obtained from the datasets, i.e., to perform statistically-sound KDD. We present ROhAN, an algorithmic framework for drawing samples from ROA models according to a user-specified distribution, which is a necessary step to assess the significance using resampling-based statistical hypothesis tests. ROhAN employs algorithms for sampling from Row-Order Enforcing (ROE) null models as subroutines: it uses the Metropolis-Hastings approach to adapt Markov-Chain-Monte-Carlo algorithms, and rejection sampling for the others. ROhAN is “future-proof” in the sense that even future algorithms for future ROE models can be easily adapted to be used by ROhAN.

Our experimental evaluation shows that ROA and ROE models are quite different, and this difference impacts the outcomes of the statistical validation of results. We also show that ROhAN is fast, and scales well.

Interesting directions for future work include the definition of ROA null models for other kind of data (e.g., real-valued datasets) and of maximum-entropy ROA models, and efficient algorithms to sample from these null models.