Keywords

1 Introduction

One of the main problems of information management in textual databases is the amount of unstructured text that is difficult to recover, due to the way of processing it. Many retrieval systems perform only syntactic text processing, which means that much of the content identification capability is lost in the retrieved text.

Frequent ordered itemsets preserve the semantics of the text since they allow related terms to remain ordered and united. This is achieved through the APO (Ordered APriori)-Structure [9], which represents the text through an intermediate form facilitating its processing, representing the content of the information and allowing greater precision and recall with the query results.

The WAPO-Structure introduces weights into the APO-Structure, so that the ordered sets can be visualized through a tag cloud with different font sizes. In this way, the tag cloud works as an assistant for the query formulation and as a tool for exploring the contents of the database.

There are many methods to obtain the WAPO-Structure, such as all those for obtaining frequent itemsets [1, 2, 11] with slight modifications. But if the inverted indexes [4] are available, the processing time is considerably reduced, since it does not have to perform repeated readings on the database to calculate support values.

Hipp et al. [6] compare several of the algorithms in terms of performance for obtaining frequent itemsets, verifying that all have a similar behavior with respect to the execution time, although spending different time depending on tasks. While some of the algorithms compared use most of the time to determine the support of the candidate itemsets below level four, the Apriori algorithm finds the greatest difficulty in calculating support for itemsets of level four or higher.

In previous work [8, 9] we obtained the WAPO-Structure from a modification of the Apriori algorithm [1]. In this paper, we propose its generation from a complete inverted index which is more advantageous as the level of the itemsets increases. With the Apriori algorithm, each time we add an item to an itemset, a new reading of the database is required, while with the inverted indexes, the maximum level itemsets are located in a single reading.

To the better understanding of the terminology used in this paper, we establish some previous concept definitions in Table 1.

Table 1. Concept definitions

This paper is organized as follows. Section 2 defines APO-Structure and WAPO-Structure. Section 3 gives the definition of Full Inverted Index. Section 4 explains the method for obtaining the WAPO-Structure from a complete inverted index. Section 5 illustrates this process with a practical example and, finally, we end with a brief discussion and conclusions in Sect. 6.

2 APO-Structure and WAPO-Structure

The lack of structure in textual attributes complicates their automatic processing. In [8, 9] we see how the APO and WAPO Structures provide the mathematical structure to obtain the semantics inherent to the text, facilitating the processing. These semantics are achieved by allowing the frequently related terms to remain together.

The complete process is carried out in five steps: Selecting the textual attribute, syntactic preprocessing, semantic preprocessing, structure generation and displaying of the structure.

The WAPO-Structure provides weight into the APO-Structure.

2.1 APO-Structure [9]

Definition 1

AP-Seq (AP-Sequences)

Let \(X = \{x_1, x_2,\ldots ,x_n\}\) be a referential set of items and R a sequence of frequent itemsets. R is then an AP-Seq if and only if:

  1. 1.
    $$\begin{aligned} \forall Z = (z_1,z_2,\ldots ,z_k) \in R \Rightarrow \left\{ \begin{array}{l} (z_1,z_2,\ldots ,z_{k-1})\in R \\ (z_2,z_3,\ldots ,z_k)\in R \\ \end{array} \right. \forall k \in [2,n] \end{aligned}$$
    (1)
  2. 2.

    \(\exists ~Y \in R\) such that:

    $$\begin{aligned}&card(Y) = max_{Z\in R}~(card(Z))~\text {and}~\not \exists ~Y'\in R~|~card(Y') = card(Y) \end{aligned}$$
    (2)
    $$\begin{aligned}&\,\,\forall Z \in R \Longrightarrow Z\subseteq Y \end{aligned}$$
    (3)

The sequence Y with higher order is the spanning sequence of R, \(R = g(Y)\), in other works g(Y) is the AP-Seq with spanning sequence Y, being the cardinal of Y the level of g(Y).

Example 1

Let X = {intelligence, online, measure, test, partner}.

figure a

Then, R is an AP-Seq with spanning sequence \(Y={<}intelligence, test, partner{>}\). All the other sequences are included in it with the same order position between the terms.

Definition 2

APO-Structure

Let \(X = \{ x_1,\ldots ,x_n \}\) be a referential set of items and \(S = \{A, B,\ldots \}\) a set of frequent item-seqs with a cardinal higher than or equal to one and \(A,B,\ldots \) AP-Seqs such as:

$$\begin{aligned} \forall ~A,B \in S;~A \nsubseteq B, B \nsubseteq A~\text {and}~B \ne A \end{aligned}$$

An APO-Structure generated by S, \(E = g(A, B,\ldots )\), is the set of AP-Seqs whose spanning sequences are \(A, B,\ldots \)

2.2 WAPO-Structure [9]

Definition 3

Frequent weighted item-seq of an AP-Seq

Let R = g(Y) be an AP-Seq with a referential set of items X. It is said that \(\widetilde{\alpha _t} \subseteq Y\) is a frequent weighted item-seq from Y if:

$$\begin{aligned} \widetilde{\alpha _t} = [\alpha _t,\omega _t]. \end{aligned}$$
(4)

where \(\alpha _t\) is a frequent term sequence and \(\omega _t\) is its weight or frequency.

WAPO-Structures are structures composed of weighted AP-Seqs which are AP-Seq composed of weighted item-seqs.

Definition 4

WAPO-Structure

Let \(X = \{ x_1, x_2,\ldots ,x_n\}\) be a referential set of items and \(\widetilde{S} = \{\widetilde{A},\widetilde{B},\ldots \}\) a set of frequent weighted item-seqs with a cardinal higher than or equal to one, and \(\widetilde{A},\widetilde{B},\ldots \) weighted AP-Seqs such as:

$$\begin{aligned} \forall ~A,B \in S;~A \nsubseteq B,~B \nsubseteq A~\text {and}~B \ne A. \end{aligned}$$
(5)

A WAPO-Structure generated by \(\widetilde{S}\), \(\widetilde{E} = g(\widetilde{A}, \widetilde{B},\ldots )\) is the set of AP-Seqs whose spanning sequences are \(\widetilde{A}, \widetilde{B},\ldots \)

Note 1

We express the spanning sequence \(\widetilde{A}\) as well as \(\widetilde{g}(A)\).

Example 2

Let us suppose a database containing tuples in Table 2.

Table 2. Tuples in the Example 2

Setting a support of 2 in terms of absolute frequency to consider an item-seq to be frequent, we obtain the following structures:

figure b

3 Complete Inverted Index

The inverted indexes are widely used in information retrieval [3, 7], as well as for other applications [5, 10]. In this work we use the definitions given in [4], to understand them, some notations are established in Table 3.

Table 3. Notations

Note 2

If \(\omega = xyz\) for the terms \(x,y,z \in \sum ^* \Rightarrow \) y is a subterm of \(\omega \), x is a prefix of \(\omega \) and z is a suffix of \(\omega \).

Definition 5

Complete Inverted Index [4]

Given a finite alphabet \(\sum \), a set of terms \(k \subseteq \sum ^+\) and a set of texts \(S \subseteq \sum ^+\), a complete inverted index for \((\sum ,k,S)\) is an abstract data type that implements the following functions:

  1. 1.

    find: \(\sum ^+ \rightarrow k \cup \{\lambda \}\), where find(\(\omega \)) is the largest prefix x of \(\omega \) with \(x \in k \cup \{\lambda \}\) and x occurs in S, that is, x is a subset of terms of a text in S.

  2. 2.

    freq: \(k \rightarrow \mathbb {N}\), where freq(\(\omega \)) is the number of times that \(\omega \) occurs is a subset of terms of a text in S.

  3. 3.

    locations: \(k \rightarrow 2^{N \times N}\), where locations(\(\omega \)) is the number of ordered pairs indicating the number of the text in which \(\omega \) occurs and its position within the text.

Definition 6

Rule of S [4]

A rule of S (\(r_S\)) is a production \(x \rightarrow _s\gamma x \beta \) where \(x \in sub(S),\gamma ,\,\beta \in \sum ^*\) that occurs each time that x is preceded by \(\gamma \) and followed by \(\beta \) in S.

Definition 7

Primary rule of S [4]

\(t_S: x \rightarrow _s\gamma x \beta \) is a primary rule of S if it is a rule S and \(\gamma \) and \(\beta \) are sets of terms of the highest possible order, that is, \(\not \exists \ \delta , \tau \in \sum ^*\) with \(\delta , \tau \ne \lambda \) such as \(x \rightarrow _s\delta \gamma x \beta \tau \) be a rule of S.

Definition 8

Implication of x in S [4]

If \(x \rightarrow _s\gamma x \beta \) is a primary rule of S, then \(\gamma x \beta \) is called implication of x in S and is denoted \(imp_S(x)\): \(P(S) = \{imp_S(x):x \in sub(S)\}\)

The members of P(S) are called subsets of primary terms of S.

4 Obtaining the WAPO-Structure Through Complete Inverted Indexes

It is possible to obtain the WAPO-Structure from the APO-Structure through inverted indexes mainly in two ways:

The first is through the Apriori algorithm, in a similar way as the AprioriTid and AprioriHybrid algorithms work [1], with a slight modification to introduce order and weight into the itemsets.

These algorithms construct iteratively the set of frequent terms, using the frequent itemsets found in a step to build the candidate itemsets and check if they are frequent in the next step.

In the first step, the support of elementary items or items of level 1 is calculated and determines which of these items are considered frequent according to the minimum support. In each subsequent step, it starts with a “seed” set consisting of itemsets found in the previous step combined with each other to generate the candidate itemsets deciding which of these are, in turn, frequent.

To do this, the Apriori algorithm requires at each step to re-read data, but the AprioriTid has the property that it is not necessary to go through the entire database to calculate the support of the candidate itemsets after the first step. For this purpose, a codification of the candidate itemsets found in the previous step is created, before deciding whether they are frequent in the subsequent step. In successive steps, the size of this coding is becoming much smaller than that of the database, saving a lot of reading effort.

This coding is the one that we can perform through the inverted index, to later apply the Apriori algorithm, just instead of going through the entire database at each step, only the inverted indexes are read, which indicate the ordered positions of each term in the text, discarding those that do not correspond with a frequent itemset for the later step.

The second way is the one proposed in this article and consists of identifying the implications of \(x_i^1\) with the spanning sets of the APO-Structure, being \(x_i^1\) each of the frequent itemsets of level equal to 1. Obviously, we would have to identify those implications that, in turn, were frequent.

To do this, the primary rules of \(x_i^1\) (tuples containing the term in question) are previously obtained and the maximals are selected, storing the frequent ones. Those not frequent are divided into sub-rules, deciding which of these are, in turn, frequent. Once the set of all the frequent rules is obtained, we eliminate the redundant and not maximal ones and the remaining rules are what we will call “frequent implications of \(x_i^1\)”, identifying them with the spanning sets of the APO-Structure.

Finally, we use the function freq to obtain the weight of the item-seqs in the APO-Structure and generate the WAPO-Structure.

Next, we define the set of frequent implications of x in S, where all the frequent implications of \(x_i^1\) are stored for identifying them with the spanning sets of the APO-Structure.

Definition 9

Set of frequent implications of x in S

Let \(P(S)=\{imp_S(x):x \in sub(S)\}\) be the set of all the implications in S, we call Pf(S) the set of all the frequent implications in S, then \(Pf_{x_i^1}(S)=\{imp_S^j(x_i^1):x \in sub(S)\}\) with \(x_i^1\) = frequent itemset of level 1 and j each of the frequent implications of the itemset i.

Definition 10

Correspondence between frequent implications and the spanning sets of the APO-Structure

Let \(E=g(A,B,\ldots ,K,\ldots )\) be the APO-Structure generated by the sets \(A, B,\ldots ,K,\ldots ,\) then:

\(A=imp_S^1(x_1^1), B=imp_S^2(x_1^1),\ldots ,K=imp_S^1(x_2^1),\ldots \) removing redundancies.

The following algorithm specifies the process in more detail. For its application it is necessary to have the complete inverted index for all the terms of the base of texts S.

figure c

5 Example of How to Obtain APO-Structure and WAPO-Structure Through Implications

Let us suppose we obtain the item-seqs listed in Table 4 from a database after cleaning the text.

The function locations(\(\omega \)) for all mono-terms is presented in Table 5 and the image of the function freq(\(\omega \)) in Table 6.

Table 4. Item-seqs after text cleaning
Table 5. locations(\(\omega \))
Table 6. freq(\(\omega \))

Let us consider a minimum support for an item-seq to be frequent greater or equal than an absolute frequency of 2. In the current example the frequent item-seqs of cardinality 1 are the mono-terms \({<}pink{>}\), \({<}yellow{>}\), \({<}blue{>}\) and \({<}green{>}\).

We compute the implications for each of these frequent item-seqs of cardinality 1. To do it first, it is necessary to compute their rules. The rules for item-seq \({<}pink{>}\) along with each rule frequency are shown in Table 7, where \(r_i\) represents the rule i and \(freq_i\) represents its frequency.

Table 7. Rules for item-seq \({<}pink{>}\)
Table 8. Primary rules for item-seq \({<}pink{>}\)
Table 9. Subrules for (\(t_2\))
Table 10. Subrules for (\(r'_1\))

Then, the primary rules are identified. We select only maximal rules. Rule \(r_4\) it is not a primary rule, as it is contained in rule \(r_2\). Rule \(r_1\) it is neither a primary rule as it is contained in rule \(r_3\).

Table 8 shows the primary rules for item-seq \({<}pink{>}\), where \(t_i\) represents the primary rule i and \(freq_i\) its frequency. Of these primary rules, only \(t_1\) is frequent regarding the support, so we store it in the set of frequent implications Pf(S).

Since \(t_2\) is not frequent, it is divided into two sub-rules that we can see in Table 9, where \(r'_i\) represents the sub-rule i and \(freq_i\) its frequency. In this case, \(r'_2\) is frequent and comes from a non-frequent primary rule, so \(r'_2\) becomes a frequent primary rule (since there is no frequent higher-order rule) and it is stored in the set of frequent implications Pf(S).

Since \(r'_1\) is not frequent, it is divided into two sub-rules \(r''_i\) (see Table 10). The rule \(r''_2\) is frequent and it is stored in Pf(S), however it is not a primary rule since there is another maximal rule in Pf(S) containing it, so it will be removed from this set. The \(r''_1\) rule is not frequent, so the operations of dividing into sub-rules, checking the frequencies and storing the frequent rules would be repeated.

Finally, two frequent implications for \({<}pink{>}\) in Pf(S) have been obtained. They are shown in Table 11.

The same procedure is applied for the item-seqs \({<}yellow{>}\), \({<}blue{>}\) and \({<}green{>}\), obtaining their frequent implications. They are shown in Tables 12, 13 and 14.

Table 11. Frequent implications for \({<}pink{>}\)
Table 12. Frequent implications for \({<}yellow{>}\)
Table 13. Frequent implications for \({<}blue{>}\)
Table 14. Frequent implications for \({<}green{>}\)

In total we have determined the next implications:

  • \(imp_S(x_1^1) = imp({<}pink{>})\Rightarrow imp^1(x_1^1)={<}pink, yellow, blue{>}\) and \(imp^2(x_1^1)={<}green, pink, yellow{>}\)

  • \(imp_S(x_2^1) = imp({<}yellow{>}) \Rightarrow imp^1(x_2^1)={<}pink, yellow, blue{>}\) and \(imp^2(x_ 2^1)={<}green, pink, yellow{>}\)

  • \(imp_S(x_3^1) = imp({<}blue{>}) \Rightarrow imp^1(x_3^1)={<}pink, yellow, blue{>}\)

  • \(imp_S(x_4^1) = imp({<}green{>}) \Rightarrow imp^1(x_4^1)={<}green, pink, yellow{>}\)

When duplicate implications are removed, two implications remain:

\(imp^1(x_1^1) = {<}pink, yellow, blue{>}\) and \(imp^2(x_1^1) = {<}green, pink, yellow{>}\)

These implications will be used as the maximal itemseqs in the APO-Structure, which has cardinal 2.

Let E be an APO-Structure, with spanning sequences A and B:

\(E = g(A,B) \Rightarrow A=imp^1(x_1^1)~and~B=imp^2(x_1^1)\)

\(E= g({<}pink, yellow, blue{>}, {<}green, pink, yellow{>})\)

\(E= ({<}pink, yellow, blue{>}, {<}green, pink, yellow{>}, {<}pink, yellow{>},\)

\({<}yellow, blue{>}, {<}green, pink{>}, {<}pink{>}, {<}yellow{>}, {<}blue{>}\) and

\({<}green{>})\)

In order to compute the weight for the WAPO-Structure, we use the function freq(\(\omega _i\)) with i = item-seq \(\in \) APO-Structure. The resulting WAPO-Structure is the following:

\(\widetilde{E}=g(({<}pink, yellow, blue{>}, 2), ({<}green, pink, yellow{>}, 2))\)

\(\widetilde{E}= (({<}pink, yellow, blue{>}, 2), ({<}green, pink, yellow{>}, 2),\)

\(({<}pink, yellow{>}, 4),({<}yellow, blue{>}, 2), ({<}green, pink{>}, 3)\)

\(({<}pink{>}, 5), ({<}yellow{>}, 5), ({<}blue{>}, 3)\) and \(({<}green{>}, 5))\)

6 Conclusions

The inverted index helps us to identify the primary rules of the frequent terms and the frequent sub-rules of the non-frequent primary rules. The set consisting of all frequent maximal rules is called “the set of frequent implications”. Each of the rules in this set corresponds to a spanning set of the APO-Structure, so we have the WAPO-Structure from these frequent implications and their frequencies.

The biggest drawback of this method is that when there are many maximal non-frequent item-seqs, a lot of time is lost in the decomposition of these sequences until finding the level in which they are frequent, in order to find the frequent implications. This drawback makes the method more appropriate in a text in which many large repeated sequences are found. In other case, it is preferable to use the Apriori algorithm, which according to Hipp et al. [6] finds the greatest difficulty in calculating support for itemsets of level four or higher and is best known for its ease and simplicity of implementation.

In short, the method proposed in this paper saves reading time by not having to go through the database repeatedly to get the frequent item-seqs as the Apriori algorithm method [1] does. Its application is recommended in databases where most of the frequent sequences are long, but the Apriori algorithm works better in other cases, depending on the characteristics of the text.