Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the challenges in information retrieval (IR) is to provide a structured view on retrieved results, which allows to better navigate within document subsets and to more easily explore the analyzed domain. While linear visualizations of result sets, i.e. simple lists of results, are still commonly used, they do not convey information about the structure of result sets, e.g. clusters of similar results. In comparison, a well-organized representation can help users to better understand the current domain of their interest and discover relationships between the retrieved documents. For this goal, we decided to study possible applications of methods of exploratory data mining known as Formal Concept Analysis (FCA) [1]. Users will be supported by an interactive visualization of a hierarchical structure of document clusters - subsets of documents with shared attributes defined within the domain. This approach can be used to support a user during an exploratory search scenario [2, 3], where the user needs to get an overview of some topic or domain. Such tool can be useful in a companion technology to support interactive information access and organization.

FCA provides methods that transform input data tables into concept hierarchies known as concept lattices. It has been successfully applied in several areas like conceptual data analysis, data and text mining, or information retrieval. In comparison to ontology it provides a dynamic view of actual input data table for user-specified query. In its basic version (the crisp case), FCA processes input data tables in the form of object-attribute models, also known as formal contexts. They are represented by binary relations, i.e., an object has or does not have an attribute value. Of course, there are many examples where it is natural to use formal contexts with multi-valued attributes. Hence, extensions of FCA were proposed in order to process object-attribute models, where the relationship between objects and their attributes is based on fuzzy relations, cf. [4,5,6,7].

For the purposes of practical data mining analysis, certain fuzzy FCA models, known as one-sided concept lattices, were introduced. In this case, object clusters are considered as ordinary subsets (as in the crisp case), while attributes are assumed to obtain fuzzy values [8]. In order to work with input data tables with different types of attributes an approach called Generalized One-Sided Concept Lattice (GOSCL) was introduced in [9]. This model produces a hierarchical structure of concepts from heterogeneous object-attribute models in the form of a concept lattice, which can be used in several applications for the analysis of input data. Although outputs of FCA-based methods in the form of concept lattices provide valuable information about object-attribute models, in many cases the resulting structure is too complex and hence a reduction is needed. Several approaches were already introduced, cf. [10,11,12]. Also for GOSCL, some methods were already tested for the reduction of concept lattices based on different ranking methods using support, stability, confidence, or other metrics, cf. [13,14,15,16].

The main goal of this paper is to apply methods like GOSCL to result sets of web search queries in order to get a structured view of the searched domain. When analyzing documents acquired from web engines using FCA-based methods, we need to prepare input tables in the form of the object-attribute model. In this case, objects are documents from the query result set and attributes are descriptive elements or features of these documents, e.g., the existence of a word within the document snippet, their importance measured by the tf-idf score, metadata of the document, etc. Then, a concept lattice can be created and visualized to the user. However, showing a concept lattice to the user is not straightforward due to the complex nature of its graph-like structure. Our goal is to support a user in interactively exploring a search domain. We hypothesize that visualizing search results as a concept lattice would support users during the process of search and exploration. In our first experiments, we integrated FCA-based methods into the graph visualization tool known as the creative exploration toolkit (CET) [17]. The main idea and some first work-in-progress details were introduced in [18]. In order to extend our system, we decided to re-use the back end of CET and its search functionality and add new interactive visualizations based on selected Javascript libraries (D3Footnote 1 and gojsFootnote 2). In comparison to our previous work, we were able to extend the features of our visualizations and therefore gain a better understanding how users benefit from the visualized concept lattice of search results.

The remainder of this paper is organized as follows: Sect. 2 provides information on related work, introducing how FCA methods can be used in information retrieval tasks and extended to fuzzy FCA approaches. Section 3 describes necessary details regarding the theory of FCA, especially details regarding the GOSCL model and notes on the reduction of concept lattices. Section 4 is devoted to our proposed approach and designed visualizations. Section 5 describes our experiments with the system and discusses results. Finally, we draw conclusions and present ideas for future work.

2 Related Work and Motivation

FCA-based methods were already used in information retrieval systems and tasks. A survey on this topic can be found in [19]. Usually, approaches are divided into:

  • Direct usage for exploration: A concept lattice is created and directly used for search and navigation in the space of hierarchically organized clusters of documents acquired from the query result set.

  • Usage for reduction: The original concept lattice is first reduced (in order to simplify results for better and more understandable visualization) and then it is used for search and navigation.

  • Usage as a ranking function: Methods from FCA are used as an analytical function in order to support the process of linear (re)ordering of the results according to the structure of the concept lattice and its features.

Examples are the CREDO system [20] and its extension CreChainDo [21], which visualize a hierarchy of document clusters or concepts. User feedback on the relevance of concepts is used to adapt the hierarchy, i.e., it is an interactive and adaptive exploration tool for documents from query result sets. Concepts are presented and organized in a tree structure, so it is a simplified and reduced version, i.e., a subpart of the concept lattice, which can be built on such data. In its extension, it is able to support a chain of the interactive steps and manage them. Similarly, in [22] an approach is described which uses an FCA lattice for navigation and attributes for query tuning. In [23] an approach for the enhancement of keyword-based search has been proposed, where local FCA models with agglomerative clustering were used in order to obtain a linear ranking of documents based on a combination of full-text and conceptual search. More work on this topic can be found in the aforementioned survey.

The main disadvantage of these approaches is that only the classic framework for FCA is used, i.e., every description of objects and their sets is based on binary attributes. Our vision is to apply a one-sided fuzzy approach, where objects are considered in a binary way. Defined clusters still are subsets, i.e., the object is or is not part of a concept or cluster. However, attributes can obtain fuzzy values. In this case particular documents are described by the set of different types of attributes, i.e., not only binary attributes are used, but we can describe the documents within the query result set also by some nominal, ordinal, numeric, or general lattice attributes, in order to have a more natural description of them according to user without the possibility for specific preprocessing and transformation (called conceptual scaling) to a binary input data table (which can be less interpretable for users). Here fuzzy approaches, like our generalized one-sided concept lattice (an approach which is able to combine different types of attributes within one heterogeneous data table), can be very useful and their application to interactive exploration should be analyzed. As it was already mentioned, we provided some first work-in-progress version of the system in [18]. Here, we would like to propose an extension of our visualization with the help of specific javascript libraries and some additional experiments with the new system.

3 Generalized One-Sided Concept Lattices

Before discussing details on GOSCL [9], we would like to introduce the standard FCA approach. If we have some set of attributes A and set of objects B, the fact that some object from B has some attribute from A is written in binary incidence relation I. Then (BAI) is called formal context and it is object-attribute model of input data, where \(B,A\ne \emptyset \) and \(I\subseteq B\times A\). Then, we are able to define a pair of mappings and , for every \(X\subseteq B\) and \(Y\subseteq A\), as followsFootnote 3:

This pair of mappings forms Galois connection, which leads to concept lattice. Consequently there is given a set

\(\mathcal {B}(B,A,I)\) forms a complete lattice called concept lattice and pairs of subsets (XY) (fixed points of Galois connection) are called formal concepts.

This crisp model can be easily extended to one-sided fuzzy concept lattice, if one-side of the model will be defined as fuzzy set. In [8] author proposed model where the input consists of an L-context (BAR), with L as a complete lattice and relation \(R:B\times A\rightarrow L\) as a binary L-fuzzy relation. It means that attributes now obtains values from L. Thanks to one-sided property, it is simple to define a pair of mappings \(\beta :2^B\rightarrow L^A\) and \(\alpha :L^A\rightarrow 2^B\), which forms Galois connection (and produces so-called one-sided fuzzy concept lattice) and is defined as follows:

$$\beta (X)(a)=\bigwedge _{b\in X} R(b,a),$$
$$\alpha (g)= \{ b\in B: \text { for each } a \in A,\ g(a)\le R(b,a)\}.$$

A problem of the previous model is that one truth value structure L is applied for every attribute. Naturally, data tables are often heterogeneous with different types of attributes. Therefore, another generalization (Generalized One-Sided Concept Lattice - GOSCL) was proposed in [9], where every attribute is able to obtain values from its own truth value structure. Now we provide some basic details of GOSCL.

A 4-tuple \(\big (B, A, \mathsf {L}, R \big )\) is generalized one-sided formal context if the following conditions are fulfilled:

  1. (1)

    B is a non-empty set of objects and A is a non-empty set of attributes.

  2. (2)

    \(\mathsf {L}:A\rightarrow \mathsf {CL}\) is a mapping from the set of attributes A to the class of all complete lattices. Hence, for an attribute \(a\in A\), \(\mathsf {L}(a)\) denotes a complete lattice, which represents a structure of truth values associated to the attribute a.

  3. (3)

    R is generalized incidence relation (input data table), i.e., \(R(b,a)\in \mathsf {L}(a)\) for all \(b\in B\) and \(a\in A\). Thus, R(ba) represents a degree from the structure \(\mathsf {L}(a)\) in which the object b has the attribute a.

Similarly to previous approaches, Galois connection will form concept lattice thanks to defined mappings, i.e., we have mappings between the classical subsets of the set of all objects \(\mathbf {P}(B)\) and the direct products of complete lattices \(\prod _{a\in A}\mathsf {L}(a)\) presented in a generalized one-sided formal context.

Given a generalized one-sided formal context \(\big (B, A, \mathsf {L}, R \big )\), the concept forming operators (which again form Galois connection) \(^{\bot }:\mathbf {P}(B)\rightarrow \prod _{a\in A}\mathsf {L}(a)\) and \(^{\top }:\prod _{a\in A}\mathsf {L}(a)\rightarrow \mathbf {P}(B)\) are defined by

$$ X^{\bot }(a)=\bigwedge _{b\in X} R(b,a),$$
$$ g^{\top }=\{ b\in B: \forall a \in A,\ g(a)\le R(b,a)\}.$$

Formal concepts are defined as the fixed points of the concept forming operators. Formally, \(\mathcal {C}\big (B, A, \mathsf {L}, R \big )\) is defined as the set of all pairs (Xg), where \(X \subseteq B\), \(g \in \prod _{a\in A}\mathsf {L}(a)\), satisfying \( X^{\bot } = g\) and \(g^{\top } = X\). Then X is called the extent and g intent of a concept (Xg). Moreover, we can define the partial order on \(\mathcal {C}\big (B, A, \mathsf {L}, R \big )\) as:

$$ (X_1, g_1) \le (X_2, g_2)\;\; \text { iff }\;\; X_1 \subseteq X_2 \text{ iff } g_1 \ge g_2.$$

Then, if \(\big (B, A, \mathsf {L}, R \big )\) is a generalized one-sided formal context, the set of all concepts \(\mathcal {C}\big (B, A, \mathsf {L}, R \big )\) with the partial order defined above forms a complete lattice, which is also called generalized one-sided concept lattice corresponding to the generalized one-sided formal context (object-attribute model) \(\big (B, A, \mathsf {L}, R \big )\).

We have also provided incremental algorithm for the creation of GOSCL model in several papers related to GOSCL (including [9]). We will skip it, because it is not necessary for understanding of GOSCL model. Now we add some small illustrative example, how input data table and its corresponding GOSCL model looks like.

Fig. 1.
figure 1

Output GOSCL model.

Table 1. Input for GOSCL model.

Let \(B = \{b_1,b_2,b_3,b_4,b_5\}\) are objects and \(A = \{a_1,a_2,a_3\}\) are attributes. Here, \(a_1\) is binary attribute represented by two element chain with values 0 and 1, attribute \(a_2\) is numeric attribute with values from real interval \({<}0,1{>}\) and attribute \(a_3\) is ordinal attribute represented by chain lattice with values from 0, 1, 2, 3. Example of data table is presented in Table 1 and corresponding generalized one-sided concept lattice constructed based on this data table is presented in Fig. 1.

One of the problems of FCA-based methods is that number of generated concepts can be large, therefore it is sometimes necessary to use some reduction technique in order to achieve a more comprehensive result or visualization. For the purpose of this work most important methods are ranking methods which reduce concepts using some metric. It means that some concepts are removed according to fact that their value in some criterion is lower than selected threshold. Especially useful for reduction of too specific concepts is the application of support metric. It means that all concepts with a number of objects lower than selected percentage of the number of all objects are removed.

4 Our Tool for FCA-Based Exploration of Search Results

The main idea for our approach is to provide a tool, where concept lattices of web search results are provided to the user. Hence, the user is able to explore the clusters of results in a structured way, where groups are composed of results which share some part of attributes similarly to any concept lattice. Additionally, the user is able to navigate through these clusters, select them and browse particular links to original web pages. While there were several approaches that tried FCA-based methods in information retrieval, our approach is unique due to the usage of the whole structure of concept lattice or its reductions, possibility to provide both general and focused view of output structure, as well as the application of generalized one-sided approach for processing of different types of input attributes.

The basic idea for implementation of the approach was to integrate FCA approach into aforementioned CET Search tool, where the user can interact with the model of query result set and explore the searched domain. CET provides the possibility to an input query, call the search API (e.g., Bing API like in the first prototype, or Google API, etc.), analyze the results, generate the graph of nodes and edges, visualize them and then explore them in the user interface. After the testing of the first prototype, which was described in [18], we decided also analyze other possibilities for extension of the approach. Currently, many javascript libraries provide various interactive visualizations, which could be easily integrated, adapted and provided to users in usual web browser interface (while CET Search tool frontend is a desktop application). Therefore, for new experiments with the FCA-based exploration, we decided to use some of them. For more focused visualization we selected Double tree graph implemented in D3 javascript library, for a general graph with DAG (Directed Acyclic Graph) layout we tried GO.JS project (which is commercial, but it is possible to use it for research or education with some restrictions).

In more details, if we would like to integrate FCA-based methods (especially fuzzy extension) into graph-based visualization, we need to create object-attribute model from query results set (use weighting of terms in results to reach fuzzy extension), then create output model in the form of concept lattice with nodes (representing concepts = clusters of web engine results), connect them using edges (relation of generalization/specification between concepts) to create hierarchy of concepts, and provide interactivity to all visualization of output model. When some reduction is applied, it should be possible to select some threshold or setup.

Now we will describe some details about the current implementation. The process how the system works follow these steps:

  1. 1.

    User input query to system - user is able to select type of visualization graph (global or focused), number of results to process (10 is default), support value for reduction (0 for no reduction is default), check for use of vector (fuzzy) or binary (crisp) model,

  2. 2.

    System will use API (Bing, Google) to get search results (query results set),

  3. 3.

    System will preprocess model (metadata and snippets are extracted, snippets are used to create object-attribute model), create concept lattice (or its reduction) according to settings, prepare JSON file for selected visualization, push JSON to visualize part of application (provide output HTML code),

  4. 4.

    System will provide use fuzzy extension of FCA in order to analyze weighting of terms in results (vector model from extracted words) and/or metadata descriptions of different types,

  5. 5.

    provide the possibility to apply some reduction (like support-based reduction).

Fig. 2.
figure 2

Example of global visualization in DAG layout - , 10 results, support 10%. Left part - reduced concept lattice, right part - details on selected concept.

In order to illustrate these visualizations, we show two small examples for binary models. First, global visualization of concept lattice (see example in Fig. 2, 10 web search results for specified query, reduced output) has two main parts. On the left side, concept lattice (or its reduction) is shown, where concepts are ordered in DAG layout supported by GO.JS implementation. It means that most general concepts (with all objects) are at the top, then other more specific concepts have an edge to parent concept. Also, the difference to parent concept is visualized on the edge, additional attributes (in this case terms, words in snippets of particular objects in concept) that specify more this concept are shown. The user is able to select one of the concepts and then on the right side, the details of selected concept (highlighted) are shown (at the start, details for top concept are shown). As we can see, the description shows a number of objects, shared attributes of this concept (in the fuzzy model also with their values), as well as information on new attributes according to the parent(s). Then particular results are shown for the current concept, including the link to the original source for every query result.

Another possibility for the user is to use interactive Double tree graph, which provides focused visualization (see Fig. 3) on one concept (in the middle of the left part) and its parents (upper to current node) and children nodes (lower to current node). Again, it is possible to see which words make difference between levels, also, nodes (concepts) are described by their most specific words. On the right side is detailed information on the current concept (same as in the previous type of visualization). The user is able to explore the concept lattice by click on some parent or child. Then clicked node will be focused in the middle (with all necessary changes to parents, nodes and detailed information). At the top of the double tree part, the user is also able to use droplist and quickly select any specific concept from the concept lattice defined by its specific words.

Fig. 3.
figure 3

Example of focused visualization in Double tree graph - , 10 results. Left part - focused view on selected concept and its neighbours, right part - details on selected concept.

When setup for visualization is different, it will apply necessary changes. Thanks to GOSCL it is possible to use vector (fuzzy) model of snippets (from text of snippet, or original document, as well as add metadata attributes) and create concept lattice. Then visualization is updated and together with specified name of attribute also change in value should be shown. In current version we only tested fuzzy model containing term frequency of snippet words, but the application is able to work with generally any type of attributes (and their combinations) which are supported by GOSCL model.

5 Experiments

In order to test our tool we did some preliminary experiments with 10 different queries. We would like to see how many concepts are produced, and how effective is the reduction. From this, we were able to estimate more suitable input parameters for visualization tool, i.e., which size of query results set and reduction give us still reasonable output of concept lattice. Then, it was interesting to see if the tool is able to be enough fast in the computation of models. For every experiment, where it was needed, we did more (usually 10) runs of the process and results are based on average numbers.

Fig. 4.
figure 4

The results of experiments with our tool (for more details see Sect. 5).

In the first experiment, we studied the relation between a number of concepts and number of searched results. We did two separate measurements based on the type of input context (binary, or fuzzy - with a minimum of term frequency 1) and no reduction of output. The result is available in Fig. 4, top-left graph, where we can see that difference in our experiments is relatively small. The main reason is the usage of snippets, were term frequencies are not very high and also input table is very sparse (as it is usual in any term-based model of documents). With a longer description of documents and additional metadata attributes, a number of fuzzy concepts will rise faster.

Another experiment tried to show a change in a number of concepts if we setup threshold for a minimum number of term existence in different documents within the query result set, i.e., terms that are in at least n documents are used for the creation of input context. We have used 100 results for the experiments, no reduction was used after creation of concept lattice. In the top-right graph of Fig. 4 we can see that number of concepts decrease fast until value 6, then decrease is linear and slower. Such reduction is good if we want to have more words which are not so specific (like only for one document in query result set), but still are able to discriminate clusters of web results.

The next experiment was related to seeing how support reduction reduce the output (see Fig. 4, bottom-left graph). As you can see, the value of support greatly reduces the number of concepts. Already at the lowest tested value of support (0.1), the number of concepts is reduced to 5%–6% in average. At 0.3, both for binary and fuzzy contexts, practically only 1% of concepts remains after the reduction phase.

The last graph available in Fig. 4 (bottom-right part) shows the computation aspects of the application within Google API. Here we found that for different queries, the most consuming part of the process is actually the step for the search of results through API, and the creation of graph. The processing times of particular operations for 100 objects in results set (on different queries) is shown in provided graph. For smaller query results sets (under 20–50 objects, depends on the usage of support reduction) with a reasonable amount of visualized concepts, the processing time of application is usually under half of one second. Therefore, there is no real limitation for usage of the tool if we want to get some useful result.

In conclusion to our preliminary experiments, we also summarized feedback from several users. The readability and usability of resulted output structure are limited to the setup of query result set size to less than 15–20 objects when the reduction is not applied. Otherwise, the structure provided by the system is too large for the user in order to get a useful understanding of clusters and their connections. If we apply reduction techniques (like support function), the readability of output structure depends on its value. By our experiments, we found that even for a maximum of 100 results, we reached a readable and good structure for navigation for support threshold with a value of 0.2. Then understanding and discrimination (or disambiguation) of subgroups of results was relatively good. In the future, other reduction techniques and extensions can help to make even better structuring and visualization of query results set. In the future, we would like to run more experiments with various inputs and setups, as well as provide more reduction techniques and user test studies based on the usage of the system by more users.

6 Conclusions

The main goal of this paper was to provide information on a tool for exploration of search results (to selected query) from web engine using Formal Concept Analysis. While some approach in this area were proposed, we decided for interactive visualization of the concept lattice (and its reductions) based on the one-sided fuzzy FCA approach, which is able to involve different types of attributes about the results of the analysis. Our tool is able to support exploration of query results set using such approaches and help in understanding the structure of the domain. In next steps, we want to extend the functionality of the tool, as well as prepare user study and larger experiments with the system.