1 Introduction

In recent years, large collections of historical handwritten documents are being scanned into digital images, in order to make them available through web sites of libraries and archives all over the world. However, the wealth of information conveyed by the text captured in these images remains largely inaccessible. Transcribing such documents by paleography experts is usually very expensive. Consequently, to exploit and make profit of such mass-digitization efforts, affordable information retrieval methods are required which allow the users to accurately and efficiently search for textual contents in large collections of untranscribed handwritten text images. This is one of the goals of projects such as HIMANISFootnote 1 [3] and READ,Footnote 2 where probabilistic indexing methods based on line-orientedword-segmentation-freeKeyword Spotting (KWS) are being developed [27, 29]. These methods rely on the same models used in handwritten text recognition (HTR), such as recurrent neural networks (RNNs) [7, 11, 24] or hidden Markov models (HMM) [2, 26, 30] for optical modeling, and N-grams for language modeling. Using these models, probabilistic word indices are built, assuming the finest search unit is the line image; that is, whole line images are analyzed to determine the degree of confidence that each given keyword appears in the image.

However, for searching in large collections involving millions of page images, line-level indexing can be less than adequate. The storage space required for the fine-grained line-level indices might become prohibitive and, on the other hand, a coarser, page-level search can be more than enough in most applications. Moreover, aiming at practical applications involving the search of general information in large image collections, we consider queries consisting in Boolean combinations of multiple words.

Boolean multi-word search can be implemented using any of the single-word KWS systems cited above. First, each word of the query is spotted separately, obtaining a set of spots (that is, lines or regions) in which each word is likely to appear above the specified confidence threshold. Then, set union, intersection and complement operations are applied to the resulting single-word spot sets to obtain the resulting set of spots of the given Boolean combined query. Yet, this still needs proper ways to combine the single-word confidence scores into the overall score of the Boolean query and check whether the combined score is higher than the given threshold.

An example of this approach is [21], which presents a (segmentation-based) KWS approach for multi-word queries formulated only with AND/OR Boolean operations. However, this approach has two main drawbacks: First, it requires a (perfect in the experiments of [21]) segmentation of all the images into individual words, which is obviously not affordable in practice for large image collections. And second, the implementation of the AND operation is inconsistent in probabilistic terms.

Clearly, only if the spotting scores are well normalized and probabilistically sound, we can follow standard probability laws to study how to consistently and adequately combine these scores. This is the idea we follow in all our works on KWS. Here, we extend the line-level indexing approach described in [27, 29] to build probabilistic word indices at the page level. In addition, we explore the feasibility of Boolean combination of single-word queries by introducing heuristic, albeit probabilistically consistent confidence score combination rules. Empirical results for page-level AND and OR word-pair Boolean queries are reported which support the consistency and usefulness of the proposed approach. This paper complements the work presented in [18] by reporting a new, larger empirical study aimed to measure the precision-recall performance of the different types of multi-word queries in a comparable way. It also includes a description of a real handwritten information retrieval system implemented using the proposed methods.

The rest of the paper is organized as follows. Section 2 overviews the probabilistic framework of single-keyword indexing and Sect. 3 introduces the probabilistic spotting scores proposed to support multi-word queries with Boolean operators. Dataset, evaluation measures, query selection and experimental set-up are presented in Sect. 4, and the empirical results are reported in Sect. 5. Section 6 outlines a demonstration system built following the proposed approach. Finally, Sect. 7 summarizes the work presented, draws conclusions and outlines future work.

2 Single-word probabilistic indexing

The indexing approach proposed in this work follows the KWS ideas originally presented in [27]. Here, a probabilistic word index is built at the page level. Let a page image, \({\mathbf {x}}\), be represented by their L text line images, \({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_L\). In turn, let each text line \({\mathbf {x}}_l\) be described as a “frame sequence”: \({\mathbf {x}}_l=x_{l1},x_{l2},\ldots ,x_{lJ_l}\), A frame is a subimage of \({\mathbf {x}}_l\) composed of some (or maybe one) contiguous line image columns, or a feature vector extracted from such subimage (typically used with HMMs [23]). For each query word v and each page image \({\mathbf {x}}\), a score \(S({\mathbf {x}}, v)\) is obtained which measures how likely is the event “keyword v is written in \({\mathbf {x}}\)”, or re-phrased as “page image \({\mathbf {x}}\) is relevant for keyword v”. This score is computed as:Footnote 3

$$\begin{aligned} S({\mathbf {x}}, v)~{\mathop{=}\limits^{{\mathrm{def}}}}\,\max _{1\le l\le L}\,\max _{l\le j\le J_l} P(v\mid {\mathbf {x}},l,j) \end{aligned}$$
(1)

where \(P(v\mid {\mathbf {x}},l,j)\), called line-level frame word posterior, is the probability that the word v is present in the page image \({\mathbf {x}}\) at line l and frame position j.

As shown in [27], the line-level frame word posteriors required for Eq. (1) can be accurately and efficiently computed for each word v in a given lexicon or vocabulary V, using the same kind of optical, lexical and language statistical models as those used in HTR. In most previous works, N-grams and HMMs/RNNs have been used for language and character optical modeling, respectively. These models are trained from moderate amounts of training images accompanied by the corresponding transcripts using well known statistical estimation techniques [11, 12]. The lexicon, V, on the other hand, is also obtained from the training transcripts and possibly expanded with additional words obtained from other relevant texts, if they are available. Using these models, \(P(v\mid {\mathbf {x}},l,j)\) is computed for each l using a word-lattice, which is in turn obtained through an extension of the conventional process used to decode \({\mathbf {x}}_l\) into its best transcript [27].

Since \(P(v\mid {\mathbf {x}},l,j)\) is a well-defined discrete probability function, the score \(S({\mathbf {x}},v)\) given in Eq. (1) can be properly used to define the following Bernoulli distribution:

$$\begin{aligned} P({R}\mid {\mathbf {x}}, v) \,{\mathop {=}\limits^{{\mathrm{def}}}}\,{\left\{ \begin{array}{ll} S({\mathbf {x}}, v) & {R}= 1\\ 1 - S({\mathbf {x}}, v) & {R}= 0 \end{array}\right. } \end{aligned}$$
(2)

where the random variable \({R}\) represents the event “page image \({\mathbf {x}}\) is relevant for keyword v”. In order to explicitly assume this probabilistic meaning of \(S({\mathbf {x}},v)\), from now on we will refer to it as \(P({R}=1\mid {\mathbf {x}},v)\), or simply \(P({R}\mid {\mathbf {x}},v)\).

To produce the probabilistically index of a page image \({\mathbf {x}}\), \(P({R}\mid {\mathbf {x}},v)\) is computed for all \(v\in V\) and non-negligible values are retained. This (moderately intensive [27], but off-line) computation is carried out for all the images of the collection to be indexed. The resulting values are stored into an adequate database or data structure, \(\mathscr {D}\), along with geometrical information about the location and size of v within \({\mathbf {x}}\). Then for a given (single-keyword) query w, \(\mathscr {D}\) is searched for those entries \({\mathbf {x}}\) such that \(P({R}\mid {\mathbf {x}},w)>\tau \), where \(\tau \) is a threshold more or less explicitly specified by the user along with w itself. The off-line indexing phase avoids heavy computations during user’s query look-up and permits extremely fast query processing.

3 Multi-keyword spotting

To simplify notation, in this section \(P({R}\mid {\mathbf {x}},v)\) will be just denoted as \(P({R}_v\mid {\mathbf {x}})\). Moreover, we restrict the discussion to a fixed page image \({\mathbf {x}}\), so it can be dropped from the formulation. This way, \(P({R}\mid {\mathbf {x}},v)\) becomes just \(P({R}_v)\).

We are interested in queries that are Boolean combinations of several keywords, \(v_1,...,v_M\), using the three basic Boolean operators: OR, AND and NOT, respectively denoted as “\(\vee \)”, “\(\wedge \)” and “\(\lnot \)”. The relevance of \({\mathbf {x}}\) for an m-fold AND query is then written as \({R}_{v_1}\wedge {R}_{v_2}\dots \wedge {R}_{v_M}\), or just \({R}_1\wedge {R}_2\dots \wedge {R}_M\), for the sake of further simplifying notation. Similarly, the event for an OR query is denoted as \({R}_1\vee {R}_2\dots \vee {R}_M\).

Computing the probability of events associated with arbitrarily complex combinations of these Boolean operators can become very complex and, moreover, even for the simplest cases, the probabilities of conditional dependencies (which can hardly be ignored) are needed. Therefore, in this paper, we propose convenient, efficiently computable approximations, based on the early work of Boole and Fréchet [5, 9, 10], and we assess their suitability for multi-word KWS through empirical tests presented in Sects. 4 and 5. These approximations are:

$$\begin{aligned}&P({R}_1\wedge {R}_2\cdots \wedge {R}_M) \,\approx \, \min (P({R}_1), P({R}_2),\dots , P({R}_M)) \end{aligned}$$
(3)
$$\begin{aligned}&P({R}_1\vee {R}_2\cdots \vee {R}_M) \,\approx \, \max (P(R_1), P({R}_2),\dots , P({R}_M)) \end{aligned}$$
(4)

In addition, the relevance probability of the NOT operator applied to a Boolean query combination, B, is computed as:

$$\begin{aligned} P(\lnot B) = 1 - P(B) \end{aligned}$$
(5)

Using these equations, the (approximate) relevance probability of any arbitrary Boolean combination of single-keyword queries can be easily and very efficiently computed. For example, to search for image regions containing both the words “cat” and “dog” but none of the words “mouse” or “rabbit” the relevance probability will be computed as:

$$\begin{aligned}&P({R}_1\wedge {R}_2\wedge \lnot ({R}_3\vee {R}_4))\\&\quad \approx \min (P({R}_1),P({R}_2), (1-\max (P({R}_3),P({R}_4)))) \end{aligned}$$

where the events \({R}_1,{R}_2,{R}_3\) and \({R}_4\) correspond to the keywords “cat”, “dog”, “mouse” and “rabbit”, respectively.

4 Experiments

To assess the effectiveness of the proposed multi-word query spotting approach, several experiments were carried out. The dataset, the evaluation measures and the experimental set-up are presented in this section.

4.1 Dataset

The whole set contains more than 80,000 images of manuscripts written by the renowned English philosopher and reformer Jeremy Bentham (1748–1832) and his secretarial staff [6]. Page images of the Bentham collection (see examples in Fig. 1) generally require non-trivial preprocessing and layout analysis to deal with noisy and/or faint writing, marginal notes, stamps, skewed images, lines with different slope in the same page, variable slant, inter-line text, etc.

Fig. 1
figure 1

Examples of Bentham page images

From the Bentham data currently available, a dataset of 433 page images is used in this work. This dataset contains nearly 100, 000 running words from a vocabulary of more than 9000 different words. Table 1 summarizes the basic statistics of this dataset.

Table 1 Basic statistics of the Bentham dataset used in this work

As discussedt in Sect. 2, our indexing models need to be trained from data. Therefore, the dataset was divided into three subsets for training, validation and test, respectively encompassing 350, 50 and 33 images. Since it was not possible to accurately identify the writers in all cases, the pages were shuffled before distributing them over these three subsets. This means that some writers can appear both in the training and in the test sets. For each page image, text line regions were automatically obtained and manually revised. Note that a non-negligible number of these regions are short lines; for example, \(9.5\%\) of them contain just one word.

This dataset is exactly the same used in the ICFHR’14 handwritten text recognition competition [1], which is also part of the dataset used in the ICFHR’14 KWS competition [19]. On the other hand, it was employed as the training set of the ICDAR’15 KWS competition [20], with a test set of 70 pages. This test set was larger than the one used in this work, but the query set (243 single words) was much smaller (see Sect. 4.2). Finally, the size of the dataset used here is comparable to that of other standard datasets used for KWS benchmarking: George Washington [15] (20 pages), IAMDB [17] (1539 small form images), Parzival [8] (45 pages), etc.

4.2 Query selection

As commented in Sect. 1, the empirical study presented in this paper explores the performance of handwritten text KWS for queries composed of one or two keywords, combined using the two Boolean operators AND and OR. Both for single and word-pair queries, the individual words were selected from a subset of training words.Footnote 4 This subset, referred to as S, was composed of 3293 words whose frequency of occurrence in the training partition ranges from 2 to 10. This avoids including most stop words (generally with word frequencies greater than 10) and also many (singleton) words that are unlikely to appear in the test partition. In order to test word-pair AND/OR queries, a set, S2, of all the 5, 420, 278 pairs of different words in S was also generated.

Despite the selection criteria adopted, only a relatively small subset of 674 words from S does appear in the test images. The corresponding single-word queries are called “pertinent”, while all the other queries are called “non-pertinent”. Similarly, not all the word-pairs in S2 are pertinent for AND and OR query types. The total (maximum) number of pertinent queries which can be composed for each query type are reported in Table 2, along with the other figures mentioned above. The table also shows the corresponding maximum ratio, \(r_{{\mathrm {max}}}\), of non-pertinent with respect to pertinent queries.

Table 2 Basic statistics of the SINGLE, AND and OR pools of queries generated

For the experiments carried out in this work, both S and S2 were adequately sampled in order to produce query sets with increasing ratios, r, of non-pertinent with respect to pertinent queries. To produce a query set with a given ratio r, first all the pertinent queries of the type considered were included in the set and then the remaining queries available for this type were randomly sampled one by one without replacement until the ratio r was reached. Following this procedure, 14, 12 and 15 query sets with increasing r (ranging from 0 to 32, c.f. Sect. 5) were generated for SINGLE, AND and OR query types, respectively. The ranges of sizes of these sets were: \(674{-}3293\), \(11{,}784{-}925{,}880\), and \(992{,}007{-}5{,}395{,}078\), for the SINGLE, AND and OR query types, respectively.

In page-level KWS experiments, in addition to the number of queries, the total number of query events, that is, the number of pairs composed of an image and a query, is also informative. A query event is pertinent if the page image is relevant for the query (i.e., the query is actually written in the image). Table 2 also shows the event-level information for the different query types. It is worth noting that the maximum proportions, \(r_{{\mathrm {max}}}\), of non-pertinent with respect to pertinent query events are much larger in this case than when measured just in terms of plain queries.

Clearly, spotting non-pertinent queries is challenging, since the system may erroneously find other similar queries, which may lead to important precision degradation. Overall, the selected queries constitute rather challenging sets.

4.3 KWS evaluation measures

To assess KWS effectiveness, we employed the standard recall and interpolated precision measures, which are functions of a threshold used to decide whether a relevance probability \(P({R}\mid {\mathbf {x}},v)\) (see Eq. (2)) is high enough to assume that a word v is in the page \(\mathbf {x}\). Interpolated precision is widely used to avoid cases in which plain precision can be ill-defined [16]. Moreover, the popular scalar measure called average precision (AP) [22, 33] and the so-called R-precision (RP) are also used. The AP is defined as the area under the Recall-Precision curve. On the other hand, the most simple RP measure is defined as the precision (or recall) for some not null threshold such that recall is equal to interpolated precision. In addition, the maximum value of the harmonic mean of precision and recall, called \(F_1\)-measure, is used also to assess the overall behavior of a search and retrieval system.

4.4 System set-up

In order to build the page-level index, transcribed line images of the training partition were used to train both the optical, and language models.

In this work, hidden Markov models (HMMs) are used for optical modeling. 86 left-to-right character HMMs were trained from the line images, represented as 24-dimensional feature vector sequences computed according to [14]. HMM training consisted in 20 iterations of the Embedded Baum-Welch algorithm [12, 32], followed by 10 iterations of Lattice-Based Extended Baum-Welch Discriminative Training, as described in [31]. Likewise, for better lexicon and language model training, an improved text tokenization was applied which rules white-space among words, punctuation marks and digits (see Table 1 and [25]). A 2-gram word language model was trained using the Kneser-Ney back-off smoothing technique [13]. Meta-parameters associated with 2-gram and HMM training (grammar scale factor, word insertion penalty, number of states per HMM and number of Gaussians per state) were tuned using the validation partition. See [25] for more details about these settings.

Finally, using the previously trained models, page-level posterior probabilities of single-word queries, \(P({R}\mid {\mathbf {x}},v)\), were obtained as in Eq. (2) (see Sect. 2), as well as the corresponding probabilities for AND and OR word-pair queries, according to Eqs. (34).

While HMM optical modeling is adopted in this work, it is worth noting that the proposed probabilistic KWS methods, both for single-word and multi-word Boolean queries, can easily be implemented on top of any kind of character-level optical modeling approach. In future work, we plan to test the impact of better optical modeling using Convolutional/Recurrent Neural Networks, as in [3].

5 Results

As shown in Table 2, the maximum ratios (\(r_{{\mathrm {max}}}\)) between non-pertinent and pertinent queries are all larger than 1. This ratio is specially large for AND queries. In the experiments presented in [18], the whole query sets of Table 2 were used to measure and compare KWS performance for the different query types. This was notoriously unfair for AND queries, because the vast majority of them (99.78%) were non-pertinent. In [18], this led to significantly lower KWS performance for AND queries which in turn misled to the conclusion that AND queries were somehow more “difficult” than SINGLE and OR queries.

As mentioned in Sect. 1, in this work, we present new empirical results using adequately balanced query sets which allow us to fairly compare the KWS performance of the different query types. To this end, query sets of increasing ratio, r, of non-pertinent queries were generated for the each type of query, as explained in Sect. 4.2. This ratio was varied in a wide range in order to accurately measure the negative impact on KWS performance of including increasing amounts of non-pertinent queries. The results of this study are shown in Fig. 2, which plots the average precision (AP) as a function of r, for the three types of queries considered.

Fig. 2
figure 2

Average Precision (AP) as a function of the ratio between no-pertinent and pertinent queries (r)

For r lower than 0.5, the three query types achieve similarly good performance, with AP values greater than 0.92. For larger values of r, AP tends to degrade rather rapidly for all query types, but somewhat less for AND queries. It is important to understand that, from a practical point of view, a ratio such as \(r=1\) is already quite large: it would correspond to the unlikely use of an information retrieval system where every other query would be issued to hopelessly try to find information which can not actually be found in the indexed collection.

In Fig. 2 both the SINGLE and the OR curves appear to end prematurely, but this is just because r has reached the maximum possible values, \(r_{{\mathrm {max}}}\), for these types of queries in the relatively small test set used in the present experiments (see Table 2). In contrast, the AND curve can still go further down, since \(r_{{\mathrm {max}}}\) in this case is very much larger (459), due to the huge amount of non-pertinent AND queries which are possible from the word-pair set described in Table 2. Consequently, only for AND queries the degradation of AP when the amount of non-pertinent queries is aggressively increased can be studied. Results of this study are presented in Fig. 3. It shows Recall – Precision (R-P) curves and the corresponding AP values for two extreme ratios of non-pertinent queries, namely \(r=459\) (already reported in [18]) and \(r=0\), along with the corresponding results for an intermediate, albeit very large ratio, \(r=64\).

Fig. 3
figure 3

Recall-Precision curves and AP results for word-pair AND query sets with extreme values of r

It is fairly clear that most of the degradation is due to false positives produced when searching for non-pertinent word-pairs. Obviously, when trying to find a word-pair which does not actually exist in any of the test images, a perfect system should not produce any spot, unless the confidence threshold is set to 0. But a real system may spot, with non-negligible confidence, images containing words different but similar to those stated in the query. This typically tends to result in degradations of the spotting precision. It is thus gratifying to observe that, even in the most extreme case where the vast majority of queries are non-pertinent, the proposed multi-word KWS approach still provides a decent, usable precision-recall performance (AP = 0.78).

To finish this section, Table 3 reports overall KWS performance in terms of AP, R-precision (RP) and maximum \(F_1\)-measure (\(\hbox {F}_1^{*}\)) figures for low and moderate proportions of non-pertinent queries, \(r=0\) and \(r=1\). We can observe that the three query types behave very similarly, with good comparable performance for each r, and best performance obviously achieved in all the cases for \(r=0\).

Table 3 Average Precision (AP), R-Precision (RP) and maximum \(\hbox {F}_1\)-measure (\(\hbox {F}_1^{*}\)) for the three query sets considered and for r equal to 0 and 1

It should be finally remarked that the performance achieved in all the cases, even the most adverse ones, is very good, as compared with results reported in recent work on segmentation-free single-word KWS. Moreover, many of these works are based on query sets extracted from the test data which, as discussed throughout this paper, ensures that all the queries are pertinent (i.e., \(r=0\)), typically helping to increase the KWS performance. Although not fully comparable with the current work (see comments in Sect. 4.1), the best result obtained in the ICFHR’14 KWS competition with the Bentham dataset only achieved a mean Average Precision (mAP)Footnote 5 of 0.42 for query-by-example (QbE) KWS. This result was later improved in [28], where a QbE KWS mAP of 0.72 was achieved. The same paper also reported a mAP of 0.86 for QbS KWS, which is more directly comparable with the results of the present paper. Likewise, the winner of the ICDAR’15 KWS competition on the Bentham dataset, a system based on RNN, achieved a mAP of 0.87. Even though RNNs are the current state-of-the-art for HTR optical modeling, it is worth noting that the KWS performance obtained in the present work, based HMM optical models, is advantageously comparable with the best result of the ICDAR’15 KWS competition.

The high degree of usability of the results here presented can be witnessed first hand through real tests using the public demonstration system described in the following section.

6 Demonstration system

In order to provide a user-friendly interface that allows for public testing of the proposed multi-word KWS approach, a demonstratorFootnote 6 was implemented with the client-server architecture shown in Fig. 4. It is composed of 4 different modules: KWS Server, HTTP Server, Data Server and HTTP Client.

The KWS Server module provides single-word confidence scores by looking up an inverted index, which is hierarchically organized in several levels: collection, book and page. It also implements the page-image level Boolean multi-word query logic and probability computations proposed in this work, along with a basic parser which understands the query-string syntax. The HTTP Server module is responsible of honoring normal requests from web clients and dynamically builds responses using the data obtained from the KWS Server and Data Server; that is, word location coordinates and corresponding document information, along with the handwritten text images to be displayed by the client. The Data Server is a database which provides the required information about indexed documents: title, description, chapters information, pages information, page images and line bounding boxes, etc. Finally, the HTTP Client module implements the GUI. It is thus in charge of interacting with the users, allowing them to pose and edit the query strings, to send these query requests to the HTTP Server and to display the query results, namely the retrieved images and the bounding boxes of the spotted keywords.

Fig. 4
figure 4

KWS web demonstrator architecture

Some technical specifications of the implemented client-server architecture shown in Fig. 4, are listed below:

  • KWS Server

    • Implements a RESTful API using HTTP

    • KWS searches are resolved using the KWS index

    • Each keyword maps to a subindex of items with their confidence score.

  • HTTP Server

    • Resolves client requests and dynamically builds responses using PHP

    • Connects to the Data Server to obtain images, book titles, etc (SQL queries)

    • Connects to the KWS Server to bypass the KWS query sent by the user and process the results (HTTP REST petitions).

  • HTTP Client

    • Sends standard HTTP requests to the HTTP Server

    • Web pages built using HTML and Javascript on the client side

  • Data Server

    • Stores all the static information from books: title, description, chapters information, pages information, page images and lines bounding boxes.

Boolean operators are represented by the characters “&&” or blank, “||” and “–” for the operators AND, OR and NOT, respectively. In addition, parenthesis “(“ and “)” can be used to unambiguously group keywords and operators. Figures 5 and 6 show search results at the book and the page levels, respectively, corresponding to the query string “(jail || prison)&& (clean || easy)”.

Fig. 5
figure 5

KWS GUI displaying search results for the query “(jail || prison) (clean || easy)” at the book level

Fig. 6
figure 6

KWS GUI displaying search results for the query “(jail || prison) (clean || easy)” at the page level

7 Remarks, conclusion and future work

Following the line-level, single-keyword, probabilistic KWS approach introduced in [27, 29], in this paper we have presented simple but probabilistically consistent approximations to deal, at the page-image level, with queries consisting in Boolean combinations of single-keywords. We have also presented a study to evaluate the search performance of multi-keyword spotting based on these approximations.

The good results achieved support the interest of the proposed methods. Based on these methods a web-based demonstration system has been developed and details of this system are also presented in this paper.

A possible drawback of the KWS approach presented here is that it relies on a predefined lexicon, fixed in the training phase, and therefore, it does not support queries involving out-of-vocabulary keywords. To overcome this limitation, a KWS approach relying on character lattices rather than on word-lattices (see Sect. 2) can be used to compute the required line-level frame word posteriors for character strings which are likely to be real words. This idea has been very successfully used in [4] to index the iconic French Chancery Collection, containing 80,000 images of densely handwritten text in medieval French and Latin.

It is important to remark that the probabilistic multi-word spotting framework formulated in Sect. 3 can be straightforwardly applied without any change to lexicon-free probabilistic indices. In fact, the lexicon-free system developed in [4] does fully support multi-word Boolean AND / OR / NOT queries and can be tried online at http://prhlt-kws.prhlt.upv.es/himanis.

In future works, we plan to extend the empirical work by studying the performance achieved for queries entailing more than two keywords and more complex Boolean expressions, including a variety of combinations of OR, AND and NOT operations. In that study, we will also explore how the occurrence frequencies of training and testing words affect the search performance of multi-word queries. As commented in Sect. 4.4, one of our immediate plans is to test the impact of better optical modeling using Convolutional/Recurrent Neural Networks, as in [4], on the proposed probabilistic keyword search methods.