Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction

In biological sequence analysis, a motifis a nucleotide or amino-acid sequence pattern which appears in a set of sequences (DNA, RNA or protein) with much higher frequency than would be expected by chance. This statistical overrepresentation is** expected to be indicative of an associated biological function. Examples of motifs include DNA- and RNA-binding sites for regulatory proteins, protein domains and protein epitopes.

DNA and RNA motifs are key to deciphering the language of gene regulatory mechanisms and, in particular, to fully understand how gene expression is regulated in time and space. For this reason, de novo (or ab initio) motif discovery, i.e. identifying motif sites (signals) in a given set of unaligned biological sequences, has attracted the attention of many biologists. However, they are also difficult to identify, since motifs often produce weak signals buried in genomic noise (i.e. the background sequence) [8]. This problem is known to be NP-hard [22], thus it is also an interesting arena for computer scientists.

Most of the motif discovery tools reported in the literature are designed to discover single motifs. However, in many (if not most) cases, biological activities are co-regulated by several factors. For instance, transcription factor-binding sites (TFBSs) on DNA are often organized in functional groups called composite motifsor cis-regulatory modules(CRM) or simply modules. These modules may have a biologically important structure that constrains both the number and relative position of the constituent motifs [34].

One example, among many that could be cited, is ETS-CBF, a cis-regulatory module constituted by three single motifs, μA, μBand CBF (core-binding factor). Both μAand μBare binding sites for two transcription factors belonging to the ETS proteins family, Ets-1 and PU.1, respectively. CBF is a protein that is implicated in the activation of several Tand myeloid cell-specific promoters and enhancers. Enhancers are cis-regulatory sequences which control the efficiency of gene transcription from an adjacent promoter. ETS-CBF is a common composite motif of enhancers implicated in the regulation of antigen receptor genes in mouse and human. A comparative study of the tripartite domain of the murine immunoglobulin μ heavy-chain (IgH) enhancer and its homologous in human has demonstrated that in both species the activity of the gene enhancer is strictly dependent on ETS-CBF [12].

Therefore, it is of great interest to discover not only single motifs but also the higher order structure into which motifs are organized, i.e. the modules. This problem is also known as composite[38] or structured[32] motifdiscovery.

Over the past few years, a plethora of single motif discovery tools have been reported in the literature (see the book by Robin et al. [36]). They differ in three aspects:

  1. 1.

    The representation of a pattern that constitutes a single motif,

  2. 2.

    The definition of overrepresentation of a motif pattern and

  3. 3.

    The search strategy applied to pattern finding.

A single motif can be represented either by a consensus sequence, which contains the most frequent nucleotide in each position of the observed signals, or by a position weight matrix (PWM), which assigns a different probability to each possible letter at each position in the motif [46].

Both consensus sequences and PWMs are derived by the multiple alignments of all the known recognition sites for a given regulatory factor and represent the specificity of a regulatory factor for its own recognition site. They refer to a sequence that matches all the sequences of the aligned recognition sites very closely, but not necessarily exactly. In a consensus sequence, this concept is expressed by notations that indicate which positions of the consensus sequence are always occupied by the same nucleotide (exact match) and which one can vary and how (allowed mismatch), without affecting the functionality of the motif. Considering the example DNA consensus sequence T[CT]NG{A}A, it has to be read in the following way: the first, fourth and sixth position in the consensus are always occupied by T, G and A, where T stands for thymidine, G for guanine and A for adenine; no mismatches are allowed in these positions. The second nucleotide in the sequence can be a cytosine (C) or alternatively a T. This mismatch does not affect the effectiveness of the recognition signal. The third position of the consensus can be occupied by any of the four nucleotide bases (A, T, C, G). At the fifth position any base can be present except A.

A PWM of a DNA motif has one row for each nucleotide base (A, T, C, G) and one column for each position in the pattern. This way, there is a matrix element for all possible basis at every position. The score of any particular motif in a sequence of DNA is the sum of the matrix values for that motif’s sequence. This score is the same of the consensus only when the motif perfectly matches the consensus. Any sequence motif that differs from the consensus in some positions will have a lower score depending on the number and type of nucleotide mismatches.

In contrast to these sequence patterns, spatial patterns have also been investigated [19], where spatial relationships (e.g. adjacency and parallelism) and shapes (e.g., α-helices in protein motifs) can be represented.

The overrepresentation of motif patterns has been defined in several ways. In some motif-discovery algorithms, a score is defined for each pattern (e.g., p-value [47] or z-score [43]), and the observed motif scores are compared with expected scores from a background model. In other algorithms, two separate values are computed when evaluating motifs, one concerning the support, or coverage, of a motif, and the other concerning the unexpectedness of a motif [35]. A third approach is to use a measure of information content [25] of discovered patterns.

Search strategies can be categorized as enumerative (or pattern-driven) and heuristic (or sequence-driven). The former enumerate all possible motifs in a given solution space (defined by a template pattern) and test each for significance, while the latter try to build a motif model by varying some model parameters such that a matching score with sequence data is maximized. In general, enumerative algorithms find optimal solutions for discrete representations of relatively short motifs, but do not scale well to larger motifs and continuous models. TEIRESIAS [35] is more sophisticated in using information about the relative occurrences of substrings; therefore, it can be used to discover discrete representations of longer motifs. Among the heuristic-based approaches, the most common is the expectation-maximization (EM) [5], which is a deterministic local search algorithm. EM may converge very fast, but the optimality of the returned point strongly depends on the starting point (seed). For this reason, it is used in combination with some randomization techniques in order to escape from a poor local optimum even if the chosen seed is bad [6].

Algorithms for the de novo discovery of modules, together with the parameters of their constituent motifs [14, 41, 52], are more recent. These algorithms, which exploit some form of spatial information (e.g., spatial correlation) on constituent motifs to identify a module, are considered particularly promising since they may offer both improved performance over conventional discovery algorithms of single motifs and insight into the mechanism of regulation directed by the constituent motifs [26]. However, in order to restrict the search space, they make some assumptions which limit their flexibility in handling variations of either the number or length of the constituent motifs or the spacing between them. For instance, the hierarchical mixture (HMx) model of CISMODULE [52] requires the specification of both the length of the module and the total number of constituent motif types. Moreover, CISMODULE does not capture the order or precise spacing of multiple TFBSs in a module. Segal and Sharan [41] propose a method for the de novo discovery of modules consisting of a combination of single motifs which are spatially close to each other. Despite the flexibility of their method in handling modules, they assume that a training set (with positive and negative examples of transcriptional regulation) is available in order to learn a discriminative model of modules. The method EMC module proposed by Gupta and Liu [14] assumes a geometrical probability distribution on the distance between TFBSs.

Although a recent study [18] has shown a significant improvement in prediction success when modules are considered instead of isolated motifs, it is largely believed that without some strong form of inductive bias,Footnote 1methods for de novo module discovery may have performance close to random. For this reason, another line of module discovery methods has been investigated (e.g., Cister [13], ModuleSearcher [1], MScan [20], Compo [39]), which takes a list of single motifs as input along with the sequence data in which the modules should be found. Single motifs are taken from motif databases, such as TRANSFAC [15] and JASPAR [37], and the challenges concern discovering which of them are involved in the module, defining the sequence of single motifs in the module and possibly discovering the inter-motif distances.Footnote 2

Module discovery methods can be categorized according to the type of framework, either discrete (e.g., CREME [42]) or probabilistic (e.g., Logos [51]), adopted to model modules. In a discrete framework, all constituent motifs must appear in a module instance. This simplifies inference and interpretation of modules, and often allows exhaustive search of optimal constituent motifs in a sequence window of a given length. Conversely, a probabilistic framework is more expressive, since it relaxes the hard constraints of discrete frameworks and associates each module with a score which is a combination (e.g., the sum) of motifs and distance scores. Issues of probabilistic frameworks are local optima and interpretability of results.

A recent assessment of eight published methods for module discovery [21] has shown that no single method performed consistently better than others in all situations and that there are still advances to be made in computational module discovery. In this chapter, we propose an innovative approach to module discovery, which can be a useful supplement or alternative to other well-known approaches. The idea is to mine rules which define “strong” spatial associations between single motifs [27]. Single motifs might either be de novo discovered by traditional discovery algorithms or taken from databases of known motifs.

The spatial relationships considered in this work are the order of motifs along the DNA sequence and the inter-motif distance between each consecutive couple of motifs, although the mining method proposed to generate spatial association rules has no limitation on both the number and the nature of spatial relationships. The association rule mining method is based on an inductive logic programming (ILP) [31] formulation according to which both data and discovered patterns are represented in a first-order logic formalism. This formulation also facilitates the accommodation of diverse sources of domain (or background) knowledge which are expressed in a declarative way. Indeed, ILP is particularly well suited to bioinformatics tasks due to its ability both to take into account background knowledge and to work directly with structured data [30]. This is confirmed by some notable success in molecular biology applications, such as predicting carcinogenesis [44, 45].

The proposed approach is based on a discrete framework, which presents several advantages, the most relevant being the straightforward interpretation of rules, but also some disadvantages, such as the hard discretization of numerical inter-motif distances or the choice of a minimum support threshold. To overcome these issues, some computational solutions have been developed and tested.

The specific features of this approach are:

  • An original perspective of module discovery as a spatial association rule mining task;

  • A logic-based approach where background knowledge can be expressed in a declarative way;

  • A procedure for the automated selection of some parameters which are difficult to properly set;

  • Some computational solutions to overcome the discretization issues of discrete approaches.

These features provide our module discovery tool several advantages with respect to competitive approaches. First, spatial association rules, which take the form of AC, provide insight both into the support of the module (represented by AC) and into the confidence of possible predictions of Cgiven A. Predictions may equally concern both properties of motifs (e.g., its type) and spatial relationships (e.g., the inter-motif distance). Second, the declarative knowledge representation facilitates the development and debugging of background knowledge in collaboration with a domain expert. Moreover, knowledge expressed in a declarative way is re-usable across different tasks and domains, thus easing the burden of the knowledge engineering effort. Third, the resort to first-order (or relational) logic facilitates the representation of input sequences, whose structure can be arbitrarily complex, and increases the explanatory power of discovered patterns, which are relatively easy to interpret for domain experts. Fourth, computational solutions devised for both the problem of selecting a minimum support threshold and the problem of discretizing numerical data fulfill the twofold goal of improving the quality of results and designing tools for the actual end-users, namely biologists.

Further significant advantages are:

  • No prior assumption is necessary either on the constituent motifs of a module or on their spatial distribution;

  • Specific information on the bases occurring between two consecutive motifs is not required.

This work also extends our previous study [48], where frequent patterns are generated by means of the algorithm GSP [3]. The extension aims to: (1) find association rules, which convey additional information with respect to frequent patterns; (2) discover more significant inter-motif distances by means of a new discretization algorithm which does not require input parameters; (3) automatically select the best minimum support threshold; (4) filter redundant rules; (5) investigate a new application of an ILP algorithm to a challenging bioinformatics task.

The chapter is organized as follows. Section 5.2presents a formalization of the problem, which is decomposed into two subproblems: (1) mining frequent sets of motifs, and (2) mining spatial association rules. Input and output of each step of the proposed approach are also reported. Section 5.3describes the method for spatial association rule mining. Section 5.4presents the solution to some methodological and architectural problems which affect the implementation of a module discovery tool effectively usable by biologists. Section 5.5is devoted to a case study, which shows the application of the developed system. Finally, conclusions are drawn.

5.2 Mining Spatial Association Rules from Sequences

Before proceeding to a formalization of the problem, we first introduce some general notions on association rules.

Association rules are a class of patterns that describe regularities or co-occurrence relationships in a set Tof homogeneous data structures (e.g., sets, sequences and so on) [2]. Formally, an association rule Ris expressed in the form of AC, where A(the antecedent) and C(the consequent) are disjoint conditions on properties of data structures (e.g., the presence of an item in a set). The meaning of an association rule is quite intuitive: if a data structure satisfies A, then it is likely to satisfy C. To quantify this likelihood, two statistical parameters are usually computed, namely supportand confidence. The former, denoted as sup(R, T), estimates the probability P(AC) by means of the percentage of data structures in Tsatisfying both Aand C. The latter, denoted as conf(R, T), estimates the probability P(C| A) by means of the percentage of data structures which satisfy condition C, out of those which satisfy condition A. The task of association rule mining consists in discovering all rules whose support and confidence values exceed two respective minimum thresholds. When data structures describe spatial objects together with their spatial relationships, mined association rules are called spatial, since conditions in either the antecedent or the consequent of a rule express some form of spatial constraint.

We now give a formal statement of the module discovery problem, which is decomposed into two subproblems as follows:

  1. 1.

    Given:A set Mof single motifs, a set Tof sequences with annotations about type and position of motifs in Mand a minimum value τ min ,Find:The collection \(\mathcal{S}\)of all the sets S 1, S 2, , S n of single motifs such that, for each S i , at least τ min sequences in Tcontain all motifs in S i .

  2. 2.

    Given:A set \(S \in \mathcal{S}\)and two thresholds σ min and κ min ,Find:Spatial association rules involving motifs in S, such that their support and confidence are greater than σ min and κ min , respectively.

Single motifs in Mcan be either discovered de novo or taken from a single motif database. Each \({S}_{i} \in \mathcal{S}\)is called motif set. The support setof S i is the subset \({T}_{{S}_{i}}\)of sequences in Tsuch that each sequence in \({T}_{{S}_{i}}\)contains at least one occurrence of each motif in S i . According to the statement of subproblem (1) \(\vert {T}_{{S}_{i}}\vert \geq {\tau }_{\mathit{min}}\). \({T}_{{S}_{i}}\)is used to evaluate both support and confidence of spatial association rules mentioned in subproblem (2).

The proposed approach is two-stepped since it reflects this problem decomposition. In the first step, motif sets which are frequent, i.e., have a support greater than τ min , are extracted from sequences annotated with predictions for known single motifs. Only information about the occurrence of motifs is considered, while spatial distribution of motifs is ignored. This step has a manifold purpose: (1) enabling biologists to guide deeper analysis only for sets of motifs which are deemed potentially interesting; (2) filtering out sequences which do not include those interesting sets of motifs; (3) lowering the computational cost of the second step.

In the second step, sequences that support specific frequent motif sets are abstracted into sequences of spaced motifs. A sequence of spaced motifs is defined as an ordered collection of motifs interleaved with inter-motif distances. Each inter-motif distance measures the distance between the last nucleotide of a motif and the first nucleotide of the next motif in the sequence. Spatial association rules are mined from these abstractions. In order to deal with numerical information on the inter-motif distance, a discretization algorithm is applied. The algorithm takes into account the distribution of the examples and does not significantly depend on input parameters as in the case of classical equal width or equal frequency discretization algorithms. Details on both steps are reported below.

5.2.1 Mining Frequent Motif Sets

To solve the first sub-problem, we resort to the levelwise breadth-first search [28] in the lattice of motif sets. The search starts from the smallest element, i.e., sets with one motif in M, and proceeds from smaller to larger sets. The frequent sets of imotifs are used to generate candidate sets of (i+ 1) motifs. Each candidate set, which is potentially frequent, is evaluated against the set Tof sequences, in order to check the actual size of its support set. Then it is pruned if its support is lower than τ min . For instance, given M= { x, y, z} and Tas in Fig. 5.1a, the set S= { x, y} is supported by T S = { t 2, t 3}. If τ min = 2, then Sis returned together with other frequent motif sets in \(\mathcal{S}\).

Fig. 5.1
figure 1

(a) Three different annotated sequences (t 1, t 2, t 3) belonging to the set Twhere motifs xand yhave been found. The greysemi-boxes underline the nucleotide sequences between two consecutive motifs (inter-motif distance). Inter-motif distances are expressed in base pairs (bp). (b) Closed-intervals of inter-motif distances

5.2.2 Mining Spatial Association Rules

The sequences in the support set T S of a frequent motif set Sare represented as chains of the form ⟨m 1, d 1, m 2, , d n− 1, m n ⟩, where each m i denotes a single motif (m i M), while each d i , \(i = 1,2,\ldots,n - 1\), denotes the inter-motif distance between m i and m i+ 1. Each chain is a sequence of spaced motifs. For instance, sequence t 2in Fig. 5.1a is represented as ⟨x, 10, y, 92, y⟩.

From a biological viewpoint, slight differences in inter-motif distances can be ignored. For this reason, we can group almost equal distances by applying a discretization technique which maps numerical values into a set of closed intervals. Therefore, a sequence of spaced motifs can be further abstracted into an ordered collection of motifs interleaved by symbols (e.g., short, medium, and large) representing a range of inter-motif distance. For instance, by considering the closed intervals in Fig. 5.1b, both sequences t 2and t 3in Fig. 5.1a are represented by the following sequence of spaced motifs:

$$\langle x,\mathit{short},y,\mathit{medium},y\rangle.$$
(5.1)

Each sequence of spaced motifs is described in a logic formalism which can be processed by the ILP system SPADA (Spatial Pattern Discovery Algorithm) [24] to generate spatial association rules. More precisely, the whole sequence, the constituent motifs and the inter-motif distances are represented by distinct constant symbols.Footnote 3Some predicate symbols are introduced in order to express both properties and relationships. They are:

  • sequence(t): tis a sequence of spaced motifs;

  • part_of(t,m): The sequence tcontains an occurrence mof single motif;

  • is_a(m,x): The occurrence mis a motif x;

  • distance(m 1 ,m 2,d): The distance between the occurrences m 1and m 2is d.

A sequence is represented by a set of Datalog Footnote 4 ground atoms, where a Datalog ground atom is an n-ary predicate symbol applied to nconstants. For instance, the sequence of spaced motif in (5.1) is described by the following set of Datalog ground atoms:

$$\left \{\begin{array}{l} \mathit{sequence}({t}_{2}), \\ \mathit{part}\_\mathit{of }({t}_{2},{m}_{1}),\ \mathit{part}\_\mathit{of }({t}_{2},{m}_{2}),\ \mathit{part}\_\mathit{of }({t}_{2},{m}_{3}), \\ \mathit{is}\_a({m}_{1},x),\ \mathit{is}\_a({m}_{2},y),\ \mathit{is}\_a({m}_{3},y), \\ \mathit{distance}({m}_{1},{m}_{2},\mathit{short}),\ \mathit{distance}({m}_{2},{m}_{3},\mathit{medium}).\end{array} \right \}$$
(5.2)

The set of Datalog ground atoms of all sequences is stored in the extensionalpart D E of a deductive database D. The intensionalpart D I of the deductive database Dincludes the definition of the domain knowledge in the form of Datalog rules. An example of Datalog rules is the following:

$$\begin{array}{l} \mathit{short}\_\mathit{medium}\_\mathit{distance}(U,V ) \leftarrow \mathit{distance}(U,V,\mathit{short}). \\ \mathit{short}\_\mathit{medium}\_\mathit{distance}(U,V ) \leftarrow \mathit{distance}(U,V,\mathit{medium}).\end{array}$$
(5.3)

They state that two motifsFootnote 5are at a short{ _}medium{ _}distanceif they are at either shortor mediumdistance (Fig. 5.1b). Rules in D I allows additional Datalog ground atoms to be deduced from data stored in D E . For instance, rules in (5.3) entail the following information from the set of Datalog ground atoms in (5.2):

$$\left \{\begin{array}{l} \mathit{short}\_\mathit{medium}\_\mathit{distance}({m}_{1},{m}_{2}), \\ \mathit{short}\_\mathit{medium}\_\mathit{distance}({m}_{2},{m}_{3}). \end{array} \right \}$$
(5.4)

SPADA adds these entailed Datalog ground atoms to set (5.2), so that atoms with the predicate short{ _ }medium{ _ }distancecan also appear in mined association rules.

Spatial association rules discovered by SPADA take the form AC, where both Aand Care conjunctions of Datalog non-ground atoms. A Datalog ground atom is an n-ary predicate symbol applied to nterms (either constants or variables), at least one of which is a variable. For each association rule, there is exactly one variable denoting the whole sequence and other variables denoting constituent motifs. An example of a spatial association rule is the following:

$$\begin{array}{rcl} \mathit{sequence}(T),& \mathit{\ part}\_\mathit{of }(T,{M}_{1}),\mathit{is}\_a({M}_{1},x),\mathit{distance}({M}_{1},{M}_{2},\mathit{short}),& \\ {M}_{1}\neq {M}_{2}& \Rightarrow \mathit{is}\_a({M}_{2},y) &\end{array}$$
(5.5)

where variable Tdenotes a sequence, while variables M 1and M 2denote two distinct occurrences of single motifs (M 1M 2) of type xand y, respectively. With reference to the sequence described in (5.2), Tcorresponds to t 2while the two distinct occurrences of single motifs M 1and M 2correspond to m 1and m 2, respectively. By means of this association rule, it is possible to infer which is the single motif that follows in a short distance a single motif x. The uncertainty of the inference is quantified by the confidence of the association rule.

Details on the association rule discovery algorithm implemented in SPADA are reported in the next section.

5.3 SPADA: Pattern Space and Search Procedure

In SPADA, the set Oof spatial objects is partitioned into a set Sof reference(or target) objectsand msets R k , 1 ≤ km, of task-relevant(or non-target) objects. Reference objects are the main subject of analysis and contribute to the computation of the support of a pattern, while task-relevant objects are related to the reference objects and contribute to accounting for the variation, i.e., they can be involved in a pattern. In the sequence described in (5.2), the constant t 2denotes a reference object, while the constants m 1, m 2and m 3denote three task relevant objects. In this case, there is only one set R 1of task-relevant objects.

SPADA is the only ILP system which addresses the task of relational frequent pattern discovery by dealing properly with concept hierarchies. Indeed, for each set R k , a generalization hierarchy H k is defined together with a function ψ k , which maps objects in H k into a set of granularity levels {1, , L}. For instance, with reference to the sequence described in (5.2), it is possible to define a three-level hierarchy H 1(Fig. 5.2), where the top level represents a generic single motif, the middle level represents distinct single motifs in Mand the lowest level represents specific occurrences of motifs. In this example, the function ψ1simply maps the root to 1, x, y, and zto 2 and m 1, m 2and m 3to 3.

Fig. 5.2
figure 2

A three-level hierarchy defined on motifs

The set of predicates used in SPADA can be categorized into four classes. The key predicateidentifies the reference objects in S(e.g., sequenceis the key predicate in description (5.2)). The property predicatesare binary predicates which define the value taken by an attribute of an object (e.g., lengthof a motif, not reported in description (5.2)). The structural predicatesare binary predicates which relate task-relevant objects (e.g., distance) as well as reference objects with task-relevant objects (e.g., part_of). The is_apredicate is a binary taxonomicpredicate which associates a task-relevant object with a value of some H k .

The units of analysisD[s], one for each reference object sS, are subsets of ground facts in D E , defined as follows:

$$D[s] = \mathit{is}\_a(R(s)) \cup D[s\vert R(s)] \cup {\bigcup \nolimits }_{{r}_{i}\in R(s)}D[{r}_{i}\vert R(s)],$$
(5.6)

where:

  • R(s) is the set of task-relevant objects directly or indirectly related to s;

  • is{ _ }a(R(s)) is the set of is{ _ }aatoms specified for each r i R(s);

  • D[s| R(s)] contains both properties of sand relations between sand some r i R(s);

  • D[r i | R(s)] contains both properties of r i and relations between r i and some r j R(s).

This notion of unit of analysis is coherent with the individual-centered representation [7], which has some nice properties, both theoretical (e.g., PAC-learnability [49]) and computational (e.g., smaller hypothesis space and more efficient search). The set of units of analysis is a partitioning of D E into a number of subsets D[s], each of which includes ground atoms concerning the task-relevant objects (transitively) related to the reference object s. With reference to the sequence described in (5.2), R(t 2) = { m 1, m 2, m 3}, and D[t 2] coincides with the whole set of ground atoms, including those inferred by means of rules in the intensional part D I of the deductive database. If several reference objects had been reported in (5.2), D[t 2] would have been a proper subset.

Patterns discovered by SPADA are conjunctions of Datalog non-ground atoms, which can be expressed by means of a set notation. For this reason they are also called atomsets[10], by analogy with itemsets introduced for classical association rules. A formal definition of atomset is reported in the following.

Definition 5.1.

An atomsetPis a set of atoms p 0(t 0 1), p 1(t 1 1, t 1 2), p 2(t 2 1, t 2 2), …, p r (t r 1, t r 2), where p 0is the key predicate, while p i , i= 1, , r, is either a structural predicate or a property predicate or an is_apredicate.

Terms t i jare either constants, which correspond to values of property predicates, or variables, which identify reference objects either in Sor in some R k . Each p i is a predicate occurring either in D E (extensionally defined predicate) or in D I (intensionally defined predicate). Some examples of atomsets are the following:

$$\begin{array}{rcl}{ P}_{1}& \equiv & \mathit{sequence}(T),\mathit{part}\_of(T,{M}_{1}),\mathit{is}\_a({M}_{1},x) \\ {P}_{2}& \equiv & \mathit{sequence}(T),\mathit{part}\_of(T,{M}_{1}),\mathit{is}\_a({M}_{1},x),\mathit{distance}({M}_{1},{M}_{2},\mathit{short}) \\ {P}_{3}& \equiv & \mathit{sequence}(T),\mathit{part}\_of(T,{M}_{1}),\mathit{is}\_a({M}_{1},x),\mathit{distance}({M}_{1},{M}_{2},\mathit{short}), \\ & & \mathit{is}\_a({M}_{2},y) \\ \end{array}$$

where variable Tdenotes a reference object, while variables M 1and M 2denote some task-relevant objects. All variables are implicitly existentially quantified.

Atomsets in the search space explored by SPADA satisfy the linkedness[16] property, which means that each variable denoting a task-relevant object in an atomset Pdefined as in Definition 5.1must be transitively linked to the reference object t 0 1by means of structural predicates. For instance, variables M 1and M 2in P 1, P 2and P 3are transitively linked to Tby means of the structural predicates distanceand part_of. Therefore, P 1, P 2and P 3satisfy the linkedness property.

Each atomset Pis associated with a granularity level l. This means that all taxonomic (is_a) atoms in Prefer to task-relevant objects, which are mapped by some ψ k into the same granularity level l. For instance, atomsets P 1, P 2and P 3are associated with the granularity level 2 according to the hierarchy H 1in Fig. 5.2and the associated function ψ1. For the same reason, the following atomset:

$${P}_{4} \equiv \mathit{sequence}(T),\mathit{part}\_of(T,{M}_{1}),\mathit{is}\_a({M}_{1},\mathit{motif })$$

is associated with the granularity level 1.

In multi-level association rule mining, it is possible to define an ancestorrelation between two atomsets Pand P′at different granularity levels.

Definition 5.2.

An atomset Pat granularity level lis an ancestorof an atomset P′at granularity level l′, l< l′, if P′can be obtained from Pby replacing each** task-relevant object hH k at granularity level l(l= ψ k (h)) with a task-relevant object h′, which is more specific than hin H k and is mapped into the granularity level l′(l′= ψ k (h′)).

For instance, the atomset P 4defined above is an ancestor of P 1, since P 1can be obtained from P 4by replacing motifwith x.

By associating an atomset Pwith an existentially quantified conjunctive formula eqc(P)obtained by transforming Pinto a Datalog query, we can now provide a formal definition of the support of Pon a deductive database D. We recall that Dhas an extensional part D E and an intensional part D I . Moreover D E includes several units of analysis D[s] one for each reference object.

Definition 5.3.

An atomset Pcoversa unit of analysis D[s] if D[s] ∪D I logically entails eqc(P) (D[s] ∪D I eqc(P)).

Each atomset Pis associated with a support, denoted as sup(P,D), which is the percentage of units of analysis in Dcovered by P. The minimum support for frequent atomsets depends on the granularity level lof task-relevant objects. It is denoted as σ min [l] and we assume that σ min [l+ 1] ≤ σ min [l], l= 1, 2, , L-1.

Definition 5.4.

An atomset Pat granularity level lwith support sup(P,D)is frequentif sup(P, D) ≥ σ min [l] and all ancestors of Pare frequent at their corresponding levels.

In SPADA, the discovery of frequent atomsets is performed according to both an intra-level and an inter-level search. The intra-level search explores the space of patterns at the same level of granularity. It is based on the level-wise method [28], which performs a breadth-first search of the space, from the most general to the most specific patterns, and prunes portions of the search space which contain only infrequent patterns.

The application of the level-wise method requires a generality ordering, which is monotonic with respect to pattern support. The generality ordering adopted by SPADA is based on the notion of θ-subsumption [33].

Definition 5.5.

P 1is more general than P 2under θ-subsumption (P 1θ P 2) if and only if P 1θ-subsumes P 2, i.e., a substitution θ exists, such that P 1θ ⊆ P 2.

For instance, with reference TO the atomsets P 1, P 2and P 3reported above, we observe that P 1θ-subsumes P 2(P 1θ P 2) and P 2θ-subsumes P 3(P 2θ P 3) with substitutions \({\theta }_{1} = {\theta }_{2} = \oslash \).

The relation ≽ θis a quasi-ordering (or preorder), since it is reflexive and transitive but not antisymmetric. Moreover, it is monotonic with respect to support [24], as stated in the following proposition.

Proposition 5.1.

Let P 1 and P 2 be two atomsets at the same level l, defined as in Definition 5.1. If P 1 θ P 2 , then sup(P1,D) ≥sup(P 2 ,D).

It is noteworthy that if P 1θ P 2and P 1is not frequent (sup(P 1, D) < σ min [l]), then also P 2is not frequent (sup(P 2, D) < σ min [l]). This monotonicity property of ≽ θwith respect to the support allows for pruning the search space without losing frequent atomsets.

In the inter-level search, atomsets discovered at level lare refined by descending the generalization hierarchies up to finding task-relevant objects mapped at level l+ 1. These are the only candidate atomsets considered for evaluation, since other candidates would not meet the necessary condition for atomsets to be frequent at level l+ 1 when σ min [l+ 1] ≤ σ min [l] (see Definition 5.4). This way, the search space at level l+ 1 is heavily pruned. Moreover, information on the units of analysis covered by atomsets at level lcan be used to make more efficient the evaluation of the support of atomsets at level l+ 1. Indeed, if a unit of analysis D[s] is not covered by a pattern Pat granularity level l, then it will not be covered by any descendant of Pat level l+ 1.

Once frequent atomsets have been generated at level l, it is possible to generate strongspatial association rules, i.e., rules whose confidence is higher than a threshold κ min [l]. In particular, each frequent atomset Pat level lis partitioned into two atomsets Aand Csuch that P= ACand the confidence of the association rule ACis computed. Different partitions of Pgenerate different association rules. Those association rules with confidence lower than κ min [l] are filtered out.

We conclude by observing that in real-world applications a large number of frequent atomsets and strong association rules can be generated, most of which are uninteresting. This is also true for the module discovery problem (e.g., constituent motifs with a large inter-motif distance). To prevent this, some pattern constraints can be expressed in a declarative form and then used to filter out uninteresting atomsets or spatial association rules [4].

5.4 Implementation

The development of a module discovery tool effectively usable by biologists demands for the solution of several problems, both methodological and architectural. Methodological problems involve data pre-processing, namely discretization of numerical data, and the automated selection of some critical parameters such as minimum support. Architectural problems concern the interface of the tool with the external world, either to acquire data and parameters or to communicate results. In this section, solutions to these problems are briefly reported.

5.4.1 Choosing the Minimum Support Threshold

Setting up the minimum support threshold σ min is not a trivial problem for a biologist when assuming no a priori knowledge about structural and functional features of potential modules. For this reason, we follow the approach suggested in [23]: users are asked to choose an interval [min, max] for the number of association rules they want to examine, and a value for σ min is then automatically derived. Indeed, the number of association rules generated by SPADA depends on σ min according to some function ϕ, which is monotonically decreasing (Fig. 5.3a). Therefore, the selection of an interval [min, max] for the number of association rules corresponds to the selection of an interval [σ1, σ2] for the support, which includes the optimal value σ min .

Fig. 5.3
figure 3figure 3

(a) Functional dependence of the number of spatial association rules from the minimum support threshold. (b) Histogram and minimum points

figure 4

Contrary to [23], where a linear search of the optimal value is proposed, we apply a dichotomic search for efficiency reasons. The formulation of the algorithm is recursive (see Algorithm 1). Initially, the procedure find{ _ }minsupis invoked on the support interval [0, 1] and SPADA is run with σ min = 0. 5. If necessary, find{ _ }minsupis recursively invoked on either [0, 0. 5] or [0. 5, 1]. Since the convergence of the algorithm cannot be proven, we stop the search when the number of recursive invocations exceeds a maximum iteration threshold MAX{ _ }ITERS. A reasonable setting is MAX{ _ }ITERS= 5, since after five iterations, the width of the interval [σ1, σ2] is relatively small (\(\frac{1} {{2}^{5}}\)).

5.4.2 Discretizing Numerical Values

SPADA cannot properly deal with numerical values of inter-motif distances. Therefore, it is necessary to transform them into categorical values through some discretization technique. The equal frequency (EF) discretization algorithm partitions the initial range of values into a fixed number of intervals (or bins), such that they have different width but approximately the same number of values. This partitioning may significantly affect the subsequent rule mining step, but unfortunately, choosing a suitable number of bins is by no means an easy task for a biologist. For this reason, we investigated a new algorithm which, similarly to EF, partitions the initial range according to data distribution, but, differently from EF, it needs no input parameter.

figure 5

This algorithm, called DUDA (Density-based Unsupervised Discretization Algorithm), is mainly inspired by clustering algorithms based on kernel density estimation [17], which groups together data that follow the same (typically normal) distribution (see Algorithm 2). Histograms are used to model the distribution of numerical data (inter-motif distances). The width wof each bin is computed by resorting to Scott’s formula [40] \(w = \frac{3.5\times s} {\root{3}\of{n}}\), where nis the number of values to discretize and sis the standard deviation of the values.

In this work, we look for bins so that the values in each bin are normally distributed. Partitions are identified by finding relative minimums in the histogram of frequency distribution (Fig. 5.3b), which are candidate split points for the partitioning.

Once the initial partitioning is defined, the algorithm works iteratively: at each iteration, it tries to merge two consecutive bins. Merging is performed when the distribution of values in the partition obtained after merging fits a normal distribution better than the two original bins. The decision of merging is based on the

Kolmogorov–Smirnov normality test, which typically works by verifying the null hypothesis “H 0: data are normally distributed,” given a confidence level. In our case, we find the minimum confidence level α for which H 0is not rejected, and we use it to identify the best merging according to the following formula:

$${L}_{a,b} = {\alpha }_{a,b} \cdot ({F}_{a} + {F}_{b}) - ({\alpha }_{a} \cdot {F}_{a} + {\alpha }_{b} \cdot {F}_{b}),$$
(5.7)

where:

  • F a (F b ) is the relative frequency of values in the partition a(b)

  • α a , α b and α a, b are the confidence values of the Kolmogorov–Smirnov test on a, band on the partition obtained after merging aand b, respectively;

  • L a, b is the lossobtained after merging aand b.

Obviously, the smaller L a, b , the better. The iteration stops when all possible L a, b are positive (no improvement is obtained) or no further merging is possible. The algorithm is recursive: it takes as input the list of partitions and the list of frequencies and returns a new list of partitions. The functions mergeand mergeFtake as input a list of relements and return a list of r− 1 elements, where two consecutive elements are appropriately merged.

5.4.3 Data Acquisition and Result Processing

SPADA has been integrated in a system which takes the set Tof sequences from a text file. This file is processed in order to mine frequent motif sets as presented in Sect. 5.2.1. The output of this first step is an XML file which is stored in an XML repository. The corresponding document type definition (DTD) is shown in Fig. 5.4. For each frequent motif set S( | t S | ≥ τ min ), the XML file describes the support set t S together with some simple statistics (e.g., the ratio | t S | ∕ | T| ). The module that implements the discretization algorithm DUDA (see Sect. 5.4.2) operates on data stored in the XML repository.

Fig. 5.4
figure 6

Hierarchical structure arrangement of elements of the XML document type definition

A wrapper of SPADA loads XML data in the extensional part D E of the deductive database Dused by SPADA itself, while rules of the intensional part D I can be edited by the user through a graphical user interface. This wrapper is also in charge of automatically setting the σ min parameter as per Algorithm 1 in Sect. 5.4.1.

By merging consecutive bins through the rules in D I , many spatial association rules are discovered, which differ only in some intervals of inter-motif distances. An unmanageably large number of association rules makes interpretation of results cumbersome for the biologist. For this reason, association rules are filtered before being shown to the user. Three filtering criteria are considered. The first criterion selects the association rules with the smallest bins among rules with the same motifs, the same confidence and supported by the same sequences. The second criterion selects the association rules with the greatest support among those with the same motifs and confidence, whose bins are included in the bins of the selected rules and whose list of supporting sequences is included in the list of supporting sequences of the selected rules. The last criterion selects the association rules with highest confidence among those with the same motifs, whose bins are included in the bins of the selected rules and whose list of supporting sequences is included in the list of supporting sequences of the selected rules.

5.5 Case Study

To show the potential of the integrated system, a pilot study is conducted on translation regulatory motifs located in the nucleotide sequences of untranslated regions (UTRs) of nuclear transcripts (mRNAs) targeting mitochondria. These motifs are essential for mRNA subcellular localization, stability and translation efficiency [50]. Evidence from recent studies supports the idea that the nature and distribution of these translation regulatory motifs may play an important role in the differential expression of mRNAs [11].

Datasets are generated as a view on three public biological databases, namely MitoRes,Footnote 6UTRef and UTRsite.Footnote 7The view integrates data on UTR sequences and their contained motifs, together with information on the motifs width and their starting and ending position along the UTR sequences in the UTRminer [48] database. We base our analysis on a set Tof 728 3UTR sequences relative to the human species. Twelve motifs are initially considered (set M). By setting τ min = 4, several frequent motif sets (set \(\mathcal{S}\)) are extracted in the first phase. We focus our attention on the motif set \(S \in \mathcal{S}\)with the largest support set (111 3UTR sequences). It contains three motifs, which are denoted as x, yand z. The hierarchy defined on motifs has three levels (Fig. 5.2), but we consider only the middle level, since the top level conveys little information on the constituent motifs of a module, while the bottom level is too specific to find interesting rules.

To discretize inter-motif distances, both EF and DUDA discretization are tested with two settings of the threshold σ min (40%and 50%) and one of κ min (80%). The number of intervals set for EF is 12. Since we have no prior knowledge on the suitability of this choice, we intentionally define some distance predicates whose semantics correspond to a merging operation of consecutive intervals (rules (5.3) reported in Sect. 5.2.2exemplify intensionally defined distance predicates for intervals merging). This way, the comparison between EF and the discretization method proposed in this chapter is fair and does not depend on our initial decision of partitioning the distances in 12 intervals.

Experimentally, we observe that the running time varies significantly between the two solutions (Table 5.1). Indeed, the use of intensionally defined predicates to merge intervals slows down the discovery process and has the undesirable effect of returning a large number of similar rules which have to be finally filtered out.

Table 5.1 Results for the two discretization algorithms

We also test the procedure for the automated selection of the σ min threshold. The interval chosen for the number of spatial association rules is \([\mathit{min} = 50,\mathit{max} = 100]\), while MAX{ _ }ITERS= 6. After five steps, the system converges to σ min = 0. 5313 and returns 85 spatial association rules (Table 5.2).

Table 5.2 Choosing the minimum support threshold

An example of spatial association rule discovered by SPADA is the following:

$$\begin{array}{rcl} & & \mathit{sequence}(T),\mathit{part}\_of(T,{M}_{1}),is\_a({M}_{1},x),\mathit{distance}({M}_{1},{M}_{2},[-99.. - 18]), \\ & & \ \ \ \ \mathit{is}\_a({M}_{2},y),\mathit{distance}({M}_{2},{M}_{3},[-99..3.5]),{M}_{1}\,\neq \,{M}_{2},{M}_{1}\,\neq \,{M}_{3},{M}_{2}\,\neq \,{M}_{3} \\ & & \quad \Rightarrow \mathit{is}\_a({M}_{3},z) \end{array}$$
(5.8)

This rule can be interpreted as follows: if a motif of type xis followed by a motif of type y, their inter-motif distance falls in the interval [\(-99.. - 18\)], and the motif of type yis followed by another motif at an inter-motif distance which falls in the interval [ − 99. . 3. 5], then that motif is of type z. The support of this rule is 63. 96%, while the confidence is 100%. The high support reveals a statistically overrepresented module, which may be indicative of an associated biological function. This module can also be represented by the following chain:

$$\langle x,[-99.. - 18],y,[-99..3.5],z\rangle,$$

which is similar to that reported as (5.1) in Sect. 5.2.2, with the difference that here intervals of inter-motif distances are reported. The high confidence means that when the conditions expressed in the antecedent hold, the type zof the third motif in the chain can be predicted with certainty. Therefore, the spatial association rule conveys additional inferential information with respect to the frequent pattern.

5.6 Conclusions

In this chapter, we describe a new approach to module discovery by mining spatial association rules from a set of biological sequences where type and position of regulatory single motifs are annotated. The method is based on an ILP formulation which facilitates the representation of the biological sequences by means of sets of Datalog ground atoms, the specification of background knowledge by means of Datalog rules and the formulation of pattern constraints by means of a declarative formalism. Although results of the method are easy to read for a data-mining expert, they are not intelligible for a biologist because of the use of first-order logic to represent spatial patterns. For this reason, the spatial association rule mining method has been implemented in a tool which effectively support biologists in module discovery tasks by graphically rendering mined association rules. The tool also supports biologists in other critical decisions, such as selecting the minimum support threshold. To face the hard discretion problem, which typically affects discrete approaches like that described in this chapter, we have also implemented a new discretization method, which is inspired by kernel density estimation-based clustering and needs no input parameters.

The tool has been applied to a pilot study on translation regulatory motifs located in the untranslated regions of messenger RNAs targeting mitochondria. The application shows the potential of the approach and methods proposed in this chapter.