1 Introduction

The usage of Formal Concept Analysis (FCA) over large data collections, also large incidence matrices, brings a range of specific problems. The first of them is connected with concept computation. The time complexity grows very fast with the size of input set of objects and attributes. A problem with the storage of computed concepts is associated with it. Other problems came up in the case of lattice visualization. Usually, humans are not able to read complex graph structures that contain tens or hundreds of vertices. There are two possible ways to solve this problem. First, the visualization methods can be improved, e.g., by using colored graphs, zooming, traversing the graph structure, etc. The second way is to reduce the input data at the cost of some acceptable noise. Such an approach is well known in information retrieval; e.g. Boolean Factor Analysis by Attractor Neural Network (Frolov et al. 2007), Formal Concept Analysis Constrained by Attribute-Dependency Formulas (Bělohlávek and Sklenář 2005), or Ordinal Factor Analysis (Ganter and Glodeanu 2012). Nevertheless, both ways require a fast approach in the computation phase to ensure acceptable computation times in the case of analyzing complex data. We developed a new algorithm to be able to use FCA to analyze the data with a certain rate of indeterminacy, dynamic or fuzzy data.

After a brief introduction to FCA, the existing FCA algorithms are introduced and a selected Valtchev’s algorithm is described in more detail. Then there are mentioned the proposed improvements. Firstly, the data storage based on finite automata and fast querying into the incidence matrices are described. The effective data storage decreases the memory consumption as well. Next, we focused on the inner process of the computation of all formal concepts. The new proposed algorithm efficiently utilizes a tree structure and a tree traversing mechanism. Several experiments illustrate the achieved result.

2 Formal concept analysis

Formal concept analysis (FCA) is a data analysis technique that describes the world in terms of objects and the attributes possessed by those objects. The philosophical starting point for FCA was represented by the understanding that a concept can be described by its extension and intension. The mathematical foundations were laid by Birkhoff (1967) who demonstrated the correspondence between partial orders and lattices. Birkhoff showed that a lattice can be constructed for every binary relation between a set of objects and a set of attributes with the resulting lattice providing insight into the structure of the original relation.

Formal concept analysis arose during the early 1980s from Rudolf Wille’s (2009) pioneer work. It identifies conceptual structures among data sets based on the primary philosophical understanding of a “concept” as a unit of thought comprising its extension and intension as a way of modelling a domain (Wille 2009; Ganter and Wille 1999). The extension of a concept is formed by all objects to which the concept applies and the intension consists of all attributes possessed by those objects. These generate a conceptual hierarchy of a domain by finding all possible formal concepts which reflect a certain relationship between attributes and objects.

The FCA method has been successfully applied to a wide range of applications in medicine (Cole and Eklund 1996), psychology (Spangenberg et al. 1999), ecology (Brüggemann et al. 1997), software engineering (Snelting 2000) or information science (Eklund et al. 2000). A variety of methods for data analysis and knowledge discovery in databases have also been proposed based on the FCA techniques (Stumme and Wille 1998; Wille 2001). Information Retrieval is also a typical application area of FCA (Godin et al. 1993; Priss 2000; Kaytoue et al. 2011; Galitsky and de la Rosa 2011; Li et al. 2012). Some researches focused on the algebraic approach and tried to improve the theoretical background behind FCA, (e.g. Wang et al. 2010; Medina and Ojeda-Aciego 2010).

2.1 The most important definitions from formal concept analysis

As mentioned above, FCA is a way of describing the world in terms of objects and the attributes possessed by those objects. This section introduces the FCA notation and conventions used throughout this article. All definitions written by Ganter and Wille (1999) are assumed and cited in this section. In other cases, all definitions are labeled according to the reference.

Definition 1

A formal context \({\mathbb {K}}:=(G, M, I)\) consists of two sets G and M and a relation I between G and M. Elements of G are called objects and elements of M are called attributes of the contextFootnote 1. In order to express that an object g is in a relation I with an attribute m, we write \((gIm)\) or \((g, m)\in I\) and read “the object g has the attribute m”.

The relation \(I\) is also called the incidence relation of the context. All relations between objects and attributes could be written in a table, also called an incidence matrix that is illustrated in Fig. 1.

Fig. 1
figure 1

An example of the incidence matrix showing existing relationships between objects {1,2,3,4,5,6} and their attributes {a,b,c,d,e,f}

A set of common attributes and a set of common objects have to be defined before the definition of a formal concept.

Definition 2

For a set \(A\subset G\) of objects we define

$$\begin{aligned} A^{\prime }=\{m\in M\ |\ gIm\ \textit{for}\ all\ g\in A\} \end{aligned}$$
(1)

-the set of attributes common to the objects in A. Similarly, for a set \(B\subset M\) of attributes we define

$$\begin{aligned} B^{\prime }\{g\in G\ |\ gIm\ \textit{for}\ all\ m\in B\} \end{aligned}$$
(2)

-the set of objects which have all attributes in B.

Definition 3

A Formal concept of the context \({\mathbb {K}}:=(G, M, I)\) (or \({\mathcal {L}}:=(G, M, I)\)) is a pair (A, B) with \(A\subseteq G\), \(B\subseteq M\), \(A^{\prime }=B\) and \(B^{\prime }=A\). We call A the extent and B the intent of the concept (A, B). The set of all concepts of a context \(\mathbb {K}\) is denoted \({\mathfrak {B}}(G,M,I)\).

A particular concept is represented as a disconnected rectangle in the incidence matrix. Figure 2 shows some concepts.

Fig. 2
figure 2

This figure shows two concepts of a context

The whole theory is very extensive, that is why we refer to the work of Ganter and Wille (1999) for a detailed description of the FCA terms. We also refer to Ganter and Wille (1999) for the basic definition of ordered sets. The Definition 4 connects the Theory of Ordered Sets with the algebraic structure.

Definition 4

An ordered set \(\mathbf{V}:=(V,\le )\) is a lattice, if for any two elements’ set {x,y}, where \(x\in V\), \(y\in V\), the supremum \(x\vee y\) and the infimum \(x\wedge y\) always exist. V is called a complete lattice, if the supremum \(\bigvee X\) and the infimum \(\bigwedge X\) exist for any subset \(X\) of V.

Theorem 1

The concept lattice \(\underline{\mathfrak {B}}(G,M,I)\) is a complete lattice in which infimum and supremum are given by:

$$\begin{aligned} \bigwedge _{t\in T}(A_t,B_t)=\left( \,\bigcap _{t\in T}A_t,\left( \,\bigcup _{t\in T}B_t\right) ^{\prime \prime }\right) \end{aligned}$$
$$\begin{aligned} \bigvee _{t\in T}(A_t,B_t)=\left( \left( \,\bigcup _{t\in T}A_t\right) ^{\prime \prime },\bigcap _{t\in T}B_t\right) , \end{aligned}$$

where \(T\) is an index set and, for every \(t\in T\), \(A_t\subseteq G\) and \(B_t\subseteq M\).

The concept lattice represents the relations between formal concepts and is it used as a basis for the further visualization. Usually, all practical usages of formal concept analysis need some kind of visual representation, because it is more readable than a simple list or table. The concept lattice can be represented by a Hasse diagram as in Example 1(b).

Example 1

Context, concept and lattice. This example refers to Fig. 1. There is the incidence matrix that defines six objects and their features. A computed list of all possible concepts is shown in Fig. 3(a) as well as related concept lattice visualized by the Hasse diagram (b).

Fig. 3
figure 3

See the Incidence matrix of an illustrative example (Fig. 1). This figure shows a computed list of all concepts (a) and a related concept lattice (b). The concept lattice \(\underline{\mathfrak {B}}(G,M,I)\) is given by the set of objects \(G=\{1,2,3,4,5,6\}\), the set of features \(M=\{a,b,c,d,e,f\}\) and \(I\subset [G,M]\) is the relation illustrated in the Fig. 1

3 FCA algorithms—the state of the art

The following sections bring to the reader some information on known approaches of concepts computation. The algorithm classes are described in more detail. The improvement of a selected algorithm and related experiments are presented in a separate section.

Computing a concept lattice is an important issue and has been widely studied in order to develop more efficient algorithms. As a consequence, a number of batch algorithms (Chain 1969; Ganter 1984; Bordat 1986; Ganter and Reuter 1991; Kuznetsov 1993; Lindig 1999) and incremental algorithms (Norris 1978; Downling 1993; Godin et al. 1995; Carpineto and Romano 1996; Ganter and Kuznetzov 1998; Nourine and Raynaud 1999; Stumme 2000; Valtchev and Missaoui 2001; Krajca et al. 2010; Andrews 2011) are mentioned. Batch algorithms build formal concepts and a concept lattice from the whole context in a bottom–up approach (from the maximal extent or intent to the minimal one) or a top–down approach (from the minimal extent or intent to the maximal one). Incremental algorithms gradually reformulate the concept lattice starting from a single object with its attribute set.

Table 1 shows a summary of time complexities of algorithmsFootnote 2. \(\left| G\right| \) denotes the number of objects, \(\left| M\right| \) the number of attributes and \(\left| L\right| \) the size of the concept lattice. In Godin’s (1995) algorithm, \(\mu \) designates an upper bound on \(\left| f({x})\right| \) where the set of objects associated with the attribute \(x\) is denoted by \(f({x})\). When there is a fixed upper bound \(\mu \), the time complexity of this algorithm is \(O(\left| G\right| )\). The Nourine algorithm Nourine and Raynaud (1999) is half-incremental. It incrementally constructs the concept set, but formulates the lattice graph in a batch. There is a max function in Merwe’s algorithm. It returns a number of elements in the largest intent of concepts, which extents consist of {g}.

Table 1 The list of most important algorithms

Most of the mentioned algorithms were used in different application areas. We refer to Yevtushenko (2004), Carpineto and Romano (1996), Lindig (1995), Siff and Reps (1997), Snelting (2000), Vogt and Wille (1995), Goethals (2002) for more information on this.

The main issues for computing all the formal concepts of a context are related to the way all the concepts of a context without the repetitive generation of the same concept can be generated. Kuznetsov and Ob’edkov (2001) noted that an empirical comparison of algorithms is not an easy task for a number of reasons. First of all, algorithms described by authors are often unclear, leading to misinterpretations. Secondly, the data structures of the algorithms and their realisations are often not specified. Another issue is related to the setting up of consensus data sets to use as test beds. The context parameters such as an average attribute set associated with an object and vice versa, or the size and density of contexts should be considered. The test environments such as programming languages, implementation techniques and platforms are also crucial factors which influence the performance of algorithms.

3.1 Selected algorithm

The Valtchev algorithm extended the Godin et al. (1995) algorithm based on two scenarios. In the first scenario, the algorithm updates the initial lattice by considering new objects one at a time. In the second one, it builds the partial lattice over the new object set first and then merges it with the initial lattice. The first algorithm showed an improvement of the Godin et al. algorithm and was recommended for small sets of objects. On the other hand, the second was recommended as the right choice for medium size sets. Selected incremental algorithm seems to be very effective in computation over small or medium data sets.

In our work we try to extend all the good features of these algorithms in order to be able to use them for the computation of larger incidence matrices. An accurate description of the algorithm Valtchev and Missaoui (2001) brought an idea of using a tree structure in the algorithm to avoid the continuous passing of a concept lattice. This will save a lot of time in the case of large data collections, and also a large and dense incidence matrix. The second improvement relates to effective data storage. We applied the common research of the Amphora Research Group ARG (2012) in the area of finite automata. A sparse binary matrix based on finite automaton was developed and we used it to store the incidence matrix and computed concepts. It allows us to read columns or rows of incidence matrices more quickly in comparison with other types of matrix storage.

3.2 What does “incremental” mean in the context of FCA algorithms

There are several dimensions, along which existing algorithms can be characterised, e.g. computation complexity, mechanism of checking of a new concept, the way of performing a calculation (i.e., batch or incremental data mining), etc. An incremental algorithm produces a concept lattice for the first \(i\) objects on the \(i\)-th execution step. The new set of concepts is obtained as a result of an update of the previous one. When a new object is added to a context, the following steps should be performed in order to obtain the concept lattice of the updated context:

  1. 1.

    Generation of new concepts in the case when the new object is meet-irreducible Footnote 3.

  2. 2.

    Update of concepts already existing in the previous lattice. Only the concepts, the intents of which are included into the new object attribute set, are updated. The new object is added to their extents.

  3. 3.

    Update of links between neighbour concepts in the case when the Hasse diagram of the lattice is required.

figure a

The typical scheme of the incremental algorithm is presented in listing 1. The procedure IncrementalConceptSetCalculation demonstrates a general scheme of the work of incremental algorithms, i.e., the generation of the concept lattice on an object per object basis.

The generalization of this algorithm known as Valtchev’s algorithm (proposed in Valtchev and Missaoui 2001) became a general pattern for the design of concept lattices. It is generally supposed that the computational complexity of this algorithm is equal to the computational complexity of the Nourine–Raynaud algorithm. However, the description of this algorithm does not include any tree structure to improve searching in a concept list. In our work, a tree structure that stored the concept list in an automaton matrix was used within Valtchev’s algorithm. Experimental results show an interesting improvement.

3.3 Valtchev’s algorithm

A problem common to all FCA algorithms consists in the size of an incidence matrix and the density of input data. Usually, not the whole concept lattice is computed because it is time-consuming. We decided to use a new alternative approach to data storage and a tree structure to improve Valtchev’s algorithm. It enables us to use this algorithm on incidence matrices that are much larger (thousands of objects and attributes). Finally, we are able to use this modified algorithm for example in document retrieval.

The following points summarize the reasons, why we decided to pay attention to Valtchev’s algorithm and its possible improvement:

  • It is an incremental algorithm and we want to work with dynamic data sets (e.g. the number of documents is increasing).

  • It is very difficult to obtain a good specification of each algorithm, because the sources are not usually well defined. Valtchev’s algorithm is well described.

  • The time complexity of this algorithm is better than the complexity of the other incremental algorithms.

  • The tree structure is not implemented. One needs to pass through the whole existing concept lattice to insert a new concept and update lattice structure.

  • The algorithm is based on cyclic reusing of the data stored in previous executive steps. Therefore, an effective data structure should be applied.

3.4 The description of Valtchev’s algorithm

First, the main idea of Valtchev’s approach will be introduced. Listings 2 represents a pivotal procedure called ADD_OBJECT. It could be divided into two parts. The first part (lines 1–11) is in charge of any specific treatment required by the presence of some attributes in the description of \(o_i\) which are not in \(A_i = \bigcup _{j<i}\{o_j\}^{\prime }\). The second part consists of a top-down traversal of lattice \({\mathcal {L}}_{i-1}\) with a recognition of the current concept’s category (lines 12–28). As for the updates, modified concepts merely receive \(o_i\) in their extent, whereas the detection of a generator \(c\) leads to the creation of the new concept \(\overline{c}\) with its intent being the intersection of the generator’s intent with \(\{o_i\}^{\prime }\) and update the generator’s extent augmented by \(o_i\) (lines 19–25). The new concept is then inserted into the (partially) completed lattice \({\mathcal {L}}_{i}\), a task which requires the detection of all its upper covers. Thus the algorithm FIND_UPPER_COVERS (see listing 3) looks through a superset of the actual covers and picks up the nodes for which the following holds: the candidate set is made up of all concepts whose intents are strictly smaller in size than \(Int(\overline{c})\). A candidate qualifies only if it is a super-concept of \(\overline{c}\), but none of its lower covers is a super-concept of \(\overline{c}\) on its own. An upper cover is physically linked to \(\overline{c}\), and the obsolete links, i.e., those relating the generator \(c\) to an upper cover, are removed. The integration of \(\overline{c}\) is completed by a link to its generator (line 26) (Valtchev and Missaoui 2001).

figure b
figure c

4 Proposed improvements

4.1 Data storage and fast querying into an incidence matrix

A quick look at Valtchev’s algorithm reveals that we should pay attention to the for-cycles (lines 12, 15, 17, 18 in listing 2 and lines 3, 4, 7 in listing 3). Especially, line 15 (see listing 2) seems to be very time-consuming in the case of a large concept lattice \(\mathcal {L}\). Moreover, Valtchev’s algorithm works with an already generated concept lattice and that is why the data should be stored very effectively because of a great number of concepts in the case of a large and dense incidence matrix. The improvement is based on the binary automaton matrix that has been used for data storing and serves as a basis of the implemented tree structure. The binary automaton matrix is a result of common research conducted by the Amphora Research Group (ARG 2012). For details on the description of the effective data storage of binary matrices see Martinovič et al. (2005). This approach was extended among others by the implementation of meet and join operators.

The concept lattice and tree structure are created in parallel. The tree structure speeds up traversing the concept lattice and concept searching. The tree structure is stored as well as the set of already generated concepts are stored in an automaton matrix. The extents and intents of all concepts are stored separately to save space and memory.

Large sparse binary matrices play an important role in Computer Science (indexing of class hierarchy Dencker et al. 1984), and in many modern information retrieval methods. These methods, such as clustering, web graph computations, web link analysis, binary factor analysis, perform a huge number of computations with matrices. Therefore, all matrices should be carefully designed (e.g. sparse matrix multiplication Yuster and Zwick 2005). Therefore, several storage techniques were developed to fit particular algorithms’ requirements, such as Yale Sparse Matrix Format (Bank and Douglas 2012), compressed Row Storage (CRS), Compressed Column Storage (CCS). However, we focused on a storage system based on the Automata Theory in our work. We refer to Eilenberg (1974), Kozen (1997) for more details on this theory.

We consider a square matrix \(M\) of order \(2^n \times 2^n\) to facilitate the application of finite automata. Each element of such a matrix is represented by an address \(\varepsilon \) of the length \(n\) over the alphabet \(\Sigma = \{0, 1, 2, 3\}\). Each element of the matrix corresponds to a subsquare of size \(2^{-n}\) of the unit square.

The main matrix is split recursively into four quadrants (submatrices). The depth of such recursion is \(n\). Each quadrant is signed by one letter of \(\Sigma \) alphabet that represents its current address \(\omega \). One letter of the alphabet is added to the quadrants’ addresses after the next step of recursion is executed.

Example 2

Addressing in a square matrix. Figure 4 shows the level-2 addresses based on the quadrants of a \(4 \times 4\) matrix. The related table shows the addresses (\(w\)) of all matrix elements.

Fig. 4
figure 4

Addressing in a \(4\times 4\) matrix

In order to specify the values of a matrix of dimension \(2^{n} \times 2^{n}\), a mapping \(\Sigma ^{n} \rightarrow R\) has to be defined, or alternately, it can be specified just by a set of non-zero values, i.e., by a language \(L \subseteq \Sigma ^{n}\) and a mapping \(f_{M}: L \rightarrow R\).

This kind of storage system allows direct access into the stored matrix. Each element can be accessed independently of the others and the whole process of getting an element from a matrix has a constant time complexity. Let \(M\) be a matrix of order \(2^n \times 2^n\). Then the time complexity of the access is bound by \(O(\log _{2}n)\). For detailed information see Snášel et al. (2002).

This approach to data storage makes it possible to implement the GET_ROW or GET_COLUMN methods that are equally time-consuming. This aspect is very important in the case of computation within all FCA algorithms.

4.1.1 Binary matrices and finite automata

Let \(M\) be a binary matrix of dimension \(n_{1} \times n_{2}\), where \(n_{1}, n_{2} \in {\mathbb {N}}\). Let \(n = max(n_{1}, n_{2})\) and \(size = 2^{\left\lceil \log _2 n \right\rceil }\). Finite automaton which is represented by matrix \(M\) is defined as:

$$\begin{aligned} A = (Q; \Sigma ; \delta ; q_0; F), \end{aligned}$$

where:

  • \(Q\) is the finite set of states;

  • \(\Sigma = \left\{ 0, 1, 2, 3 \right\} \) is the input alphabet;

  • \(\delta : Q \times \Sigma \rightarrow Q\) is the transition function;

  • \(q_0\) is the initial state;

  • \(F = \{q_f\}\) is the set of (one) finite state.

The language \(L\) which is accepted by automaton \(A\) is defined:

$$\begin{aligned} L = \left\{ w \in \Sigma ^*\ |\ \exists x,y \in {\mathbb {N}}, 1 \le x,y \le n;\ then\ if\right. \end{aligned}$$
$$\begin{aligned} M_{x,y} = 1,then\ w = l(x, y, size) \end{aligned}$$

where \(l: {\mathbb {N}} \times {\mathbb {N}} \times {\mathbb {N}} \rightarrow \Sigma ^*\):

$$\begin{aligned}&l(x,y,val)\\&\quad = \left\{ \begin{array}{l@{\quad }l} \varepsilon &{} { {for}} \ val = 0 \\ 3l(x-val,y-val,val/2) &{} { {for}} \ x \ge val \wedge y \ge val\\ 2l(x-val,y,val/2) &{} { {for}} \ x \ge val \wedge y < val\\ 1l(x,y-val,val/2) &{} { {for}} \ x < val \wedge y \ge val\\ 0l(x,y,val/2) &{} { {for}} \ x < val \wedge y < val\\ \end{array}\right. \end{aligned}$$

4.1.2 Reconstruction of the matrix

For each element of the matrix \(M_{x,y}\) holds

$$\begin{aligned} M_{x,y} = \left\{ \begin{array}{l@{\quad }l} 1 &{} { {automaton}}\ A\ { {accepts}}\ l(x,y,size)\\ 0 &{} { {otherwise.}}\\ \end{array}\right. \end{aligned}$$

4.1.3 Construction of the automaton

Finite automaton for a given binary matrix can be constructed in several ways, e.g. by the usage of regular expression (Dvorský 2004) or as a composition of elementary automatons (Rozenberg and Salomaa 1997). All these methods of construction have a common disadvantage—the whole matrix, i.e., a complete language has to be known in advance. Better choice consists of an incremental construction of the automaton.

The incremental construction method starts with an empty automaton (with the exception of a global initial and final state) and step by step updating of the automata. As soon as a new element is inserted into the matrix the corresponding word is added to the automaton and the automaton is transformed into a new one. The whole algorithm is well described in Martinovič et al. (2005).

Usually, a high space complexity is the main disadvantage of such algorithms. The number of states of the automaton is increased after a new word \(w \in L\) is inserted. Therefore, the so called minimization of the automata is performed after a given number of insertions. Theorems and proofs of one minimization approach are described in Martinovič et al. (2005).

Example 3

Automaton matrix. Let \(M\) be a matrix of size \(4 \times 6\)

$$\begin{aligned} M = \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 \\ \end{array} \right] \end{aligned}$$

The matrix is described by the language \(L_{M} = \{ 000, 013, 020, 100, 123, 022 \}\). Initial automaton \(A_{0}\) does not accept any language, i.e., \(L_{0} = \emptyset \). Automaton \(A_{1}\) is constructed from \(A_{0}\) and accepts language \(L_{1} = \{ 000 \}\). Automaton \(A_{2}\) is an extension of automaton \(A_{1}\) and accepts language \(L_{2} = L_{1} \cup \{ 013 \} = \{ 000, 013 \}\). In this way automaton \(A_{6}\) accepts the whole language \(L_{M}\) (see Fig. 5). This is an expanded, non-minimized form of automaton. Now, automaton \(A_{6}\) can be minimized. The whole minimization process starts in the final state \(q_{f}\) and goes to the initial state of the automaton. The minimizing process takes the two following steps:

  1. 1.

    There are two transitions between \(q_{2}\) and \(q_{f}\) and \(q_{6}\) and \(q_{f}\), respectively. The transition symbol is \(0\) in both cases. These two transitions can be joined together and state \(q_{6}\) is deleted (see Fig. 6).

  2. 2.

    There are also two transitions between \(q_{3}\) and \(q_{7}\) as initial states and \(q_{f}\) as the final state. Transitions are labelled by symbol \(3\). As well as in the first step, transitions are unified and state \(q_{7}\) is deleted (see Fig. 7).

Fig. 5
figure 5

Non-minimized automaton \(A_{6}\)

Fig. 6
figure 6

Automaton \(A_{6}\), the state \(q_{6}\) will be deleted

Fig. 7
figure 7

Minimized automaton \(A_{6}\), the state \(q_{7}\) will be deleted

The minimization process plays an important role with respect to the usage of the automaton matrix. The data structure based on a binary matrix automaton is a hidden element of the suggested algorithm, however, it represents an essential part of the improvement of Valtchev’s algorithm. The parallel tree structure as well as the incidence matrix or the list of computed concepts are stored in such a data structure, which saves a lot of time and storage space in the case of large data sets.

4.2 Algorithm improvements

Listings 4 represents the procedure that creates a new concept list (CREATE_CONCEPT_LIST). A list of objects that should be added and a concept lattice (if it exists) are the inputs of this procedure. The procedure can be divided into two parts. The first one is called the initialization phase (lines 2–7). There is a decision point that either creates a new concept lattice and a related tree structure, or loads a concept lattice and a tree structure of already existing concepts. The tree structure is represented by a square adjacency automaton matrix \(n\times n\), where \(n\) represents the number of all existing concepts plus the number of concepts that will be added. The tree structure is dynamic. We refer to Martinovič et al. (2005) to see more details of the process of adding a new element into an automaton matrix. It is done incrementally. That is why the tree structure can dynamically grow with the size of the concept lattice. The method CREATE_TREE is responsible for the creation of a new matrix \(|{\mathcal {L}}^{+}|\times |{\mathcal {L}}^{+}|\), whereas the method LOAD_TREE loads the existing tree and enlarges its matrix upon \(|{\mathcal {L}}|\times |{\mathcal {L}}|\). The second part called the updating phase, starts with a loop (line 8) over all new objects. The most important method with respect to lattice updating is INSERT_CONCEPT (see listing 5). The method TRY_ALL_CHILDREN (see listing 6) is responsible for the updating of the tree structure and will be described in more detail together with the INSERT_CONCEPT method. However, there is one more method used within the CREATE_CONCEPT_LIST procedure. It is called GET_NODE_CHILDREN and its input is the index of a selected node of the tree structure. This method returns a vector that consists of the nodes’ indexes of all children of the selected node in the tree structure.

figure d

This procedure (see listing 5) updates not only the list of concepts but also the tree structure. The inner method GET_CONCEPT returns an index of the concept with the specified intent in case it exists in the concept list (line 3). The method returns NULL in case such a concept does not exist; then the ADD_CONCEPT method is called. This method computes whole concepts (\(^{\,'}\!\)operator, the list of common objects is returned) and adds them into the concept list. After that, the method ADD_TREE_NODE updates the tree structure and a new link between the “parent” concept and the new concept is created. An index of the new concept is \(|{\mathcal {L}}^{+}|\). If the GET_CONCEPT method returns an index, then it means that the concept \(\overline{c}\) already exists with a given intent and the list of objects is inserted into the \(\overline{c}.Extent\) (line 9). The INSERT_CONCEPT procedure returns TRUE if a new concept has been added, otherwise FALSE.

figure e

This procedure (see listing 6) represents a recursive part of the update phase. If a new object has been inserted into some concept, it is necessary to update all the other concepts linked to this one in the tree structure. So the tree structure is updated. We leave the description of the following procedure in the recursive form, because it is more vivid and it better elucidates the meanings of the updating process. Of course, it is easy to transform it into the non-recursive form which has smaller memory demands.

figure f

5 Comparison and experiments

The new improved FCA algorithm was described in the previous section. The following text is focused on experiments and comparison with the original Valtchev’s algorithm. All tested algorithms and their variants were implemented in the same programming language, by the same person, and with an emphasis on maximum performance and code optimization. Here is a short legend for all the tables in this section:

  • i—the index of computation (test)

  • \(\mathbf {|G|}\)—represents the number of objects in the incidence matrix

  • \(\mathbf {|M|}\)—represents the number of attributes in the incidence matrix

  • d—is a density of an incidence matrix, i.e., the percentage of non-zero elements. The density of a context \(\mathbb {K}:=(G, M, I)\) is given by the following formula:

    $$\begin{aligned} d = \frac{|I|}{|G|\cdot |M|} * 100 \end{aligned}$$

    An incidence matrix is usually said to be dense if \(d>25\), however, the property of being dense is rather vague, and there is no rigorous criterion for it.

  • \(\mathbf {|\mathcal {L}|}\)—is a size of concept lattice.

  • Gajdos—computation time of the improved algorithm. The incidence matrix has been stored by using a binary automaton matrix. The used tree structure is also based on an automaton matrix as well as the output storage space for all computed concepts.

  • Valtchev—computation time of the original algorithm by Petko Valtchev. The incidence matrix is stored here by the usage of CRS method. The system of storing all concepts depends on the size and density of the incidence matrix.

  • ValtchevA—computation time of the original algorithm again, however the incidence matrix has been stored using the automaton matrix. The system of storing of all concepts depends on the size and density of the incidence matrix.

All experiments were performed on the following hardware platform: Intel Core2 CPU, 2.4 GHz, 8GB RAM, Windows 7 64-bit.

5.1 An experiment with a fix matrix size and increasing density

In this experiment (see Table 2), three algorithms were used. We want to show that a simple usage of the automaton matrix (see ValtchevA in the table) does not yield any improvement. On the contrary, such an algorithm is even more time consuming, because no tree structure is used and a sparse automaton is created. It is evident, that the usage of the Gajdos algorithm is more effective in the case of dense incidence matrices. Table 2 shows the results up to a density of 50 %. The higher density leads to a dual problem solution where zeros and ones are swapped. Some computation times were too large to have importance for real applications (see \(>\)1M values in Table 2).

Table 2 Variable matrix density, \(|G|=100\), \(|M|=100\)

5.2 An experiment with an increasing number of attributes in the incidence matrix

We did not use the ValtchevA algorithm in this experiment (see Table 3). The previous experiment has shown that such an algorithm is not effective. Particular incidence matrices consist of 1,000 objects and the density has been always set to 5 % of the respective matrix.

Table 3 Variable number of attributes, \(|G|=1,000\), \(d=5\) %

5.3 An experiment with an increasing number of objects in the incidence matrix

At the beginning, the incidence matrix consists of 1,000 objects (\(|G|\)) and 100 attributes (\(|M|\)). The density \(d\) has always been 5%. The number of attributes and the density were fixed. Table 4 shows the total times of computations (in seconds) for the specific size of incidence matrix. The ValtchevA algorithm has not been tested any more due to its obvious ineffectivity.

Table 4 Variable number of objects, \(\left| M\right| =100\), \(d=5\) %

It is evident, that the improved algorithm is more effective again. The incremental time, that we need to insert 1,000 objects into the existing concept lattice, grows quite slowly in comparison with Valtchev’s algorithm (compare two consecutive computation times in the appropriate table column).

The incidence matrix is not very important for the whole computation because a new object is compared with concepts in the concept lattice \(\mathcal {L}\). In this case, the automaton matrix and a related tree structure play an important role within the improved algorithm.

6 Conclusion and future work

The practical approach to all of the complex problems usually leads to particular problems concerning large data sets, needs of the simplification and acceleration of computation phases. In this article, we described several aspects and problems related to Formal Concept Analysis and its usage in the area of information retrieval. The algorithmic aspects were described in more detail and a new algorithm based on Valthev’s algorithm was introduced. The reasons for choosing this algorithm were presented as well as several improvements based on the usage of the automaton matrix and tree structures. Finally, a set of experiments demonstrated significant improvements. The proposed algorithm can be applied to larger data sets.

Current research in this area and the near future work will consist of the utilization of modern hardware, e.g. graphics processor units (GPUs) to improve performance. Of course, not every part of the proposed algorithm could be done in a parallel way. On the other hand, designed data structures can easily be implemented so that they can be represented in a suitable form for GPUs, e.g. dynamic textures, etc.