It is event that most big data represented as non-structured or semi-structured forms, such as images, text and others. It is important to study how to use an abstract form to show data, either structured, non-structured or semi-structured or use label proportions to categorize the nature of data so that a data mining or data analytic algorithm can be performed smoothly. Leaning methods are very useful tools for understanding the data. Learning algorithms can be considered from different aspects, such as cognitive computing, mathematics, and machine learning.

This chapter deals with different learning techniques in the contexts of data science. Section 6.1 discusses the view of learning through the concept (the abstract of big data), which includes four subsections. Section 6.1.1 is about concept-cognitive learning model for Incremental concept learning [58]. Section 6.1.2 is a concurrent concept-cognitive learning model for classification [60]. Section 6.1.3 is a semi-supervised concept learning by concept-cognitive learning and conceptual clustering [42]. Section 6.1.4 is a fuzzybased concept learning method-exploiting data with fuzzy conceptual clustering [43]. Section 6.2 presents how to use the label proportion for learning that consists of another four subsections. Section 6.2.1 is a fast algorithm for learning from label proportions [84]. Section 6.2.2 is a learning from label proportions with generative adversarial networks [39]. Section 6.2.3 is a learning from label proportions on high-dimensional data [57]. Section 6.2.4 is a learning from label proportions with pinball loss [59]. Section 6.3 explores other enlarged learning models with two subsections. Section 6.3.1 is about classifying with adaptive hyper-spheres: an incremental classifier based on competitive learning [38]. Section 6.3.2 is a construction of robust representations for small data sets using broad learning system [66].

1 Concept of the View of Learning

1.1 Concept-Cognitive Learning Model for Incremental Concept Learning

Cognitive computing is viewed as an emerging computing paradigm of intelligent science that implements computational intelligence by trying to solve the problems of imprecision, uncertainty and partial truth in biological system [44, 68, 72]. As far as we know, it has been investigated by simulating human cognitive processes such as memory [33, 62] learning [14, 30, 36], thinking [69] and problem solving [70].

In this subsection, a novel CCLM is proposed based on a formal decision context. Moreover, to reduce its computational complexity, granular computing is included in our model. The main contributions are as follows:

  1. (1)

    We describe a new model for incremental learning from the perspective of cognitive learning by the fusion of concept learning, granular computing, and formal decision context theory. More precisely, it is an attempt to construct a novel incremental algorithm by imitating human cognitive processes, and a new theory has been proposed for concept classification under a formal decision context.

  2. (2)

    Beyond traditional CCL systems such as approximate CCL system [35, 36], three-way CCL system [37] and theoretical CCL system [68,69,70], CCLM has obtained incremental concept learning and generalization ability.

  3. (3)

    Different from other classifiers, similar to the human learning processes, the previously acquired knowledge can be directly stored into concept lattice space in CCLM and it performs a good interpretation by concept hierarchies (e.g., Hasse diagram [16]).

1.1.1 Preliminaries

Now, we briefly review some basic notions related to (1) formal context, (2) formal decision context and (3) concept-cognitive learning.

1.1.1.1 A. Formal Context and Formal Decision Context

Definition 6.1 ([75])

A formal context is a triplet (G, M, I), where G is a set of objects, M is a set of attributes, and I ⊆ G × M is a binary relation between G and M. Here, gIm means that the object g has the attribute m.

Furthermore, the derivation operator (⋅) is defined for A ⊆ G and B ⊆ M as follows:

$$\displaystyle \begin{aligned} \begin{gathered} A'=\{m\in M|gIm\ \mbox{for}\ \mbox{all}\ g\in A\},\\ B'=\{g\in G|gIm\ \mbox{for}\ \mbox{all}\ m\in B\}. \end{gathered} \end{aligned} $$
(6.1)

A′ is the maximal set of the attributes that all the objects in A have in common and B′ is the maximal set of the objects shared by all the attributes in B. A concept in the context (G, M, I) is defined to be an ordered pair (A, B) if A′ = B and B′ = A, where the elements A and B of the concept (A, B) are called the extent and intent, respectively. The set of all concepts forms a complete lattice, called the concept lattice and denoted by L(G, M, I).

Definition 6.2 ([74, 82])

A formal decision context is a quintuple (G, M, I, D, J), where (G, M, I) and (G, D, J) are two formal contexts. M and D are respectively called the conditional attribute set and the decision attribute set with M ∩ D = ∅.

Definition 6.3 ([34])

Let (G, M, I, D, J) be a formal decision context and E ⊆ M. For any (A, B) ∈ L(G, E, I E) and (Y, Z) ∈ L(G, D, J), if A ⊆ Y , and A, B, Y  and Z are nonempty, then we say that (Y, Z) can be implied by (A, B), which is denoted by (A, B) → (Y, Z).

By Definitions 6.2 and 6.3, we obtain the relationship between the conditional attribute set and the decision attribute set.

1.1.1.2 B. Concept-Cognitive Learning

Let G be an object set and M be an attribute set. We denote the power sets of G and M by 2G and 2M, respectively. In addition, \(\mathcal {F}:2^{G}\rightarrow 2^{M}\) and \(\mathcal {H}:2^{M}\rightarrow 2^{G}\) are supposed to be two set-valued mappings, and they are rewritten as \(\mathcal {F}\) and \(\mathcal {H}\) for short.

Definition 6.4 ([36])

Set-valued mappings \(\mathcal {F}\) and \(\mathcal {H}\) are called cognitive operators if for any A 1, A 2 ⊆ G and B ⊆ M, the following properties hold:

$$\displaystyle \begin{aligned} &\mathrm{(i)} \qquad \ \ A_{1}\subseteq A_{2}\Rightarrow \mathcal{F}(A_{2})\subseteq \mathcal{F}(A_{1}),\\ &\mathrm{(ii)} \qquad \; \mathcal{F}(A_{1}\cup A_{2})\supseteq \mathcal{F}(A_{1})\cap \mathcal{F}(A_{2}),\\ &\mathrm{(iii)} \qquad\mathcal{H}(B)=\{g\in G|B\subseteq \mathcal{F}(\{g\})\}. \end{aligned} $$

For convenience, hereinafter \(\mathcal {F}(\{g\})\) is rewritten as \(\mathcal {F}(g)\) for short when there is no confusion.

Definition 6.5 ([36])

Let \(\mathcal {F}\) and \(\mathcal {H}\) be cognitive operators. For g ∈ G and m ∈ M, we say that \((\mathcal {H}\mathcal {F}(g),\mathcal {F}(g))\) and \((\mathcal {H}(m),\mathcal {F}\mathcal {H}(m))\) are granular concepts.

Definition 6.6 ([36])

Let G i−1, G i be object sets of {G t} and M i−1, M i be attribute sets of {M t} , where {G t} is a non-decreasing sequence of object sets G 1, G 2, …, G n and {M t} is a non-decreasing sequence of attribute sets M 1, M 2, …, M m. Denote ΔG i−1 = G i − G i−1 and ΔM i−1 = M i − M i−1. Suppose

$$\displaystyle \begin{aligned} 1) &&&\!\!\mathcal{F}_{i-1}\!:2^{G_{i-1}}\!\!\rightarrow 2^{M_{i-1}},&&\!\!\mathcal{H}_{i-1}\!:2^{M_{i-1}}\!\!\rightarrow 2^{G_{i-1}},\\ 2) &&&\!\!\mathcal{F}_{\Delta G_{i-1}}\!:2^{\Delta G_{i-1}}\!\!\rightarrow 2^{M_{i-1}}, &&\!\!\mathcal{H}_{\Delta G_{i-1}}\!:2^{M_{i-1}}\!\!\rightarrow 2^{\Delta G_{i-1}},\\ 3) &&&\!\!\mathcal{F}_{\Delta M_{i-1}}\!:2^{G_{i}}\!\!\rightarrow 2^{\Delta M_{i-1}},&&\!\!\mathcal{H}_{\Delta M_{i-1}}\!:2^{\Delta M_{i-1}}\!\!\rightarrow 2^{G_{i}},\\ 4) &&&\!\!\mathcal{F}_{i}\!:2^{G_{i}}\!\!\rightarrow 2^{M_{i}},&&\!\!\mathcal{H}_{i}\!:2^{M_{i}}\!\!\rightarrow 2^{G_{i}} \end{aligned} $$

are four pairs of cognitive operators satisfying the following properties:

$$\displaystyle \begin{aligned} &\mathcal{F}_{i}(g)\ \ = \begin{cases} \mathcal{F}_{i-1}(g)\cup \mathcal{F}_{\Delta M_{i-1}}(g), &\!\!\!\mbox{if}\ g\in G_{i-1},\\ \mathcal{F}_{\Delta G_{i-1}}(g)\cup \mathcal{F}_{\Delta M_{i-1}}(g), &\!\!\!\mbox{otherwise}, \end{cases} \end{aligned} $$
(6.2)
$$\displaystyle \begin{aligned} &\mathcal{H}_{i}(m)= \begin{cases} \mathcal{H}_{i-1}(m)\cup \mathcal{H}_{\Delta G_{i-1}}(m), &\!\!\!\ \mbox{if}\ m\in M_{i-1},\\ \mathcal{H}_{\Delta M_{i-1}}(m), &\!\!\!\ \mbox{otherwise}, \end{cases} \end{aligned} $$
(6.3)

where \(\mathcal {F}_{\Delta G_{i-1}}(g)\) and \(\mathcal {H}_{\Delta G_{i-1}}(m)\) are set to be empty when ΔG i−1 = ∅, and \(\mathcal {F}_{\Delta M_{i-1}}(g)\) and \(\mathcal {H}_{\Delta M_{i-1}}(m)\) are set to be empty when ΔM i−1 = ∅. Then we say that \(\mathcal {F}_{i}\) and \(\mathcal {H}_{i}\) are extended cognitive operators of \(\mathcal {F}_{i-1}\) and \(\mathcal {H}_{i-1}\) with the newly input information ΔG i−1 and ΔM i−1.

In other words, based on Definitions 6.4 and 6.5, the basic mechanism of concept-cognitive process is shown in Definition 6.6.

1.1.2 Theoretical Foundation

In this section, for adapting to dynamic learning and classification task, we show some new notions and properties for the proposed CCLM.

1.1.2.1 A. Initial Concept Generation

Definition 6.7

A regular formal decision context is a quintuple (G, M, I, D, J), where for any z 1, z 2 ∈ D, \(\mathcal {H}(z_{1})\cap \mathcal {H}(z_{2})=\emptyset \). (G, M, I) and (G, D, J) are called the conditional formal context and the decision formal context, respectively.

Note that it means that each real-world object is associated with a single label.

Definition 6.8

Let (G, M, I, D, J) be a regular formal decision context, and \(\mathcal {F}\) and \(\mathcal {H}\) be cognitive operators. For g ∈ G and m ∈ M, we say that \((\mathcal {H}\mathcal {F}(g),\mathcal {F}(g))\) and \((\mathcal {H}(m),\mathcal {F}\mathcal {H}(m))\) are conditional granular concepts. Similarly, for y ∈ G and z ∈ D, \((\mathcal {H}\mathcal {F}(y),\mathcal {F}(y))\) and \((\mathcal {H}(z),\mathcal {F}\mathcal {H}(z))\) are decision granular concepts. For simplicity, we denote

$$\displaystyle \begin{aligned} \begin{gathered} \mathcal{G}^{C}=\{(\mathcal{H}\mathcal{F}(g),\mathcal{F}(g))|g\in G \} \cup \{(\mathcal{H}(m),\!\mathcal{F}\mathcal{H}(m))|m\in M \},\\ \mathcal{G}^{D}=\{(\mathcal{H}\mathcal{F}(y),\mathcal{F}(y))|y\in G\} \cup\{(\mathcal{H}(z),\!\mathcal{F}\mathcal{H}(z))|z\in D\}, \end{gathered} \end{aligned}$$

where \(\mathcal {G}^{C}\) and \(\mathcal {G}^{D}\) are respectively called as condition-concept space and decision-concept space (or class-concept space) under cognitive operators \(\mathcal {F}\) and \(\mathcal {H}\).

Property 6.1

Let (G, M, I, D, J) be a regular formal decision context, and \(\mathcal {F}\) and \(\mathcal {H}\) be cognitive operators. Then for any A, Y ⊆ G, B ⊆ M and Z ⊆ D, we have

$$\displaystyle \begin{aligned} \begin{gathered} \mathcal{F}(A)=\bigcap_{g\in A}\mathcal{F}(g),\mathcal{F}(Y)=\bigcap_{y\in Y}\mathcal{F}(y),\\ \mathcal{H}(B)=\bigcap_{m\in B}\mathcal{H}(m),\mathcal{H}(Z)=\bigcap_{z\in Z}\mathcal{H}(z). \end{gathered} \end{aligned} $$
(6.4)

Proof

It is immediate from Definitions 6.4 and 6.7. □

Property 6.2

Let (G, M, I, D, J) be a regular formal decision context. For any \((A_{G},B_{G})\in \mathcal {G}^{C}\) and \((Y_{G},Z_{G})\in \mathcal {G}^{D}\), if A G ⊆ Y G, and A G, B G, Y G and Z G are nonempty, then we say that A G is associated with the class of Z G under the attribute set B G. It means that the object g can be represented by a single label z when \(A_{G}=\mathcal {H}\mathcal {F}(g)\) and \(Z_{G}=\mathcal {F}\mathcal {H}(z)\).

Proof

It is immediate from Definitions 6.3 and 6.8. □

Definition 6.9

Let (G, M, I, D, J) be a regular formal decision context and D 1, D 2, …, D l be nonempty and finite class sets of D, where D = D 1 ∪ D 2 ∪… ∪ D l and D r ∩ D j = ∅(1 ≤ r, j ≤ l, r ≠ j). We call \(G_{i}^{D}=G_{i}^{D_{1}}\cup G_{i}^{D_{2}}\cup \ldots \cup G_{i}^{D_{l}}\) is class-object set under the i-th cognitive state.

For brevity, we write \(G_{i}^{D}\) as G i and the corresponding set of G i is denoted by \(\left \{G_{i}\right \}=\bigcup\limits _{j=1}^{l}\big \{G_{i}^{D_{j}}\big \}\). Considering that the information will be updated by different classes, we initiate and learn concepts by different labels. For convenience, for any D j ⊆ D, the subclass-object sets \(G_{1}^{D_{j}},G_{2}^{D_{j}},\ldots ,G_{n}^{D_{j}}\) with \(G_{1}^{D_{j}}\subseteq G_{2}^{D_{j}}\subseteq \ldots \subseteq G_{n}^{D_{j}}\) are denoted by \(\big \{G_{t}^{D_{j}}\big \}\!\!\uparrow \).

Property 6.3

Let (G, M, I, D, J) be a regular formal decision context, we have

$$\displaystyle \begin{aligned} \left\{G_{t}^{D}\right\}\!\!\uparrow =\big\{G_{t}^{D_{1}}\big\}\!\!\uparrow \cup \big\{G_{t}^{D_{2}}\big\}\!\!\uparrow \cup \ldots \cup \big\{G_{t}^{D_{l}}\big\}\!\!\uparrow. \end{aligned} $$
(6.5)

Proof

It is immediate from Definitions 6.6 and 6.9. □

From Definitions 6.7 and 6.8, and Property 6.1, the initial concepts can be constructed by condition-concept space and class-concept space in a regular formal decision context. Then, an object can be associated with a single label by the interaction between condition-concept space and class-concept space from Property 6.2. Property 6.3 means that a cognitive state can be decomposed into some cognitive sub-states by different categories in a regular formal decision context. Therefore, hereinafter we only discuss the situation under a cognitive sub-state D j.

1.1.2.2 B. Concept-Cognitive Process

Considering the information on the object set G and the attribute set M will be updated as time goes by in the real world, we discuss that how the concept spaces are timely updated in a regular formal decision context.

Definition 6.10

Let (G, M, I, D, J) be a regular formal decision context, \(G_{i-1}^{D_{j}},G_{i}^{D_{j}}\) be two subclass-objects of \(\big \{G_{t}^{D_{j}}\big \}\!\!\uparrow \) and M i−1, M i be attribute sets of {M t}  . Denote \(\Delta G_{i-1}^{D_{j}}=G_{i}^{D_{j}}\!\!-\!\!G_{i-1}^{D_{j}},\Delta M_{i-1}=M_{i}\!\!-\!\!M_{i-1}\). Suppose

$$\displaystyle \begin{aligned} 1)&\ \mathcal{F}_{D_{j},i-1}^{M}\!\!:\!\!2^{G_{i-1}^{D_{j}}}\!\!\rightarrow\!\! 2^{M_{i-1}},\qquad \ \mathcal{H}_{D_{j},i-1}^{M}\!\!:\!\!2^{M_{i-1}}\!\!\rightarrow\!\! 2^{G_{i-1}^{D_{j}}},\\ 2)&\ \mathcal{F}_{D_{j},i-1}^{D}\!\!:\!\!2^{G_{i-1}^{D_{j}}}\!\!\rightarrow\!\! 2^{D}, \qquad \quad \ \ \mathcal{H}_{D_{j},i-1}^{D}\!\!:\!\!2^{D}\!\!\rightarrow\!\! 2^{G_{i-1}^{D_{j}}},\\ 3)&\ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}\!\!:\!\!2^{\Delta G_{i-1}^{D_{j}}}\!\!\rightarrow\!\! 2^{M_{i-1}},\quad \ \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}\!\!:\!\!2^{M_{i-1}}\!\!\rightarrow\!\! 2^{\Delta G_{i-1}^{D_{j}}},\\ 4)&\ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\!\!:\!\!2^{\Delta G_{i-1}^{D_{j}}}\!\!\rightarrow\!\! 2^{D}, \qquad \ \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\!\!:\!\!2^{D}\!\!\rightarrow\!\! 2^{\Delta G_{i-1}^{D_{j}}},\\ 5)&\ \mathcal{F}_{D_{j},\Delta M_{i-1}}^{M}\!\!:\!\!2^{G_{i}^{D_{j}}}\!\!\rightarrow\!\! 2^{\Delta M_{i-1}}, \quad \ \mathcal{H}_{D_{j},\Delta M_{i-1}}^{M}\!\!:\!\!2^{\Delta M_{i-1}}\!\!\rightarrow\!\! 2^{G_{i}^{D_{j}}},\\ 6)&\ \mathcal{F}_{D_{j},i}^{M}\!\!:\!\!2^{G_{i}^{D_{j}}}\!\!\rightarrow\!\! 2^{M_{i}},\qquad \qquad \ \mathcal{H}_{D_{j},i}^{M}\!\!:\!\!2^{M_{i}}\!\!\rightarrow\!\! 2^{G_{i}^{D_{j}}},\\ 7)&\ \mathcal{F}_{D_{j},i}^{D}\!\!:\!\!2^{G_{i}^{D_{j}}}\!\!\rightarrow\!\! 2^{D},\qquad \qquad \ \ \mathcal{H}_{D_{j},i}^{D}\!\!:\!\!2^{D}\!\!\rightarrow\!\! 2^{G_{i}^{D_{j}}} \end{aligned} $$

are seven pairs of cognitive operators in a regular formal decision context satisfying the following properties:

$$\displaystyle \begin{aligned} &\!\!\!\!\mathcal{F}_{D_{j},i}^{M}(g)= \begin{cases} \mathcal{F}_{D_{j},i-1}^{M}(g)\cup \mathcal{F}_{D_{j},\Delta M_{i-1}}^{M}(g),&\!\!\!\!\!\!\mbox{if}\ g \in G_{i-1}^{D_{j}},\\ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(g)\cup \mathcal{F}_{D_{j},\Delta M_{i-1}}^{M}(g), &\!\!\!\!\!\mbox{otherwise}, \end{cases} \end{aligned} $$
(6.6)
$$\displaystyle \begin{aligned} &\!\!\!\!\mathcal{H}_{D_{j},i}^{M}(m)= \begin{cases} \mathcal{H}_{D_{j},i-1}^{M}(m)\cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(m),&\!\!\!\!\mbox{if}\ m \in M_{i-1},\\ \mathcal{H}_{D_{j},\Delta M_{i-1}}^{M}(m), &\!\!\!\!\mbox{otherwise}, \end{cases} \end{aligned} $$
(6.7)
$$\displaystyle \begin{aligned} &\!\!\!\!\mathcal{F}_{D_{j},i}^{D}(y)= \begin{cases} \mathcal{F}_{D_{j},i-1}^{D}(y),&\mbox{if}\ y\in G_{i-1}^{D_{j}},\\ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(y),&\mbox{otherwise}, \end{cases} \end{aligned} $$
(6.8)
$$\displaystyle \begin{aligned} &\!\!\!\!\mathcal{H}_{D_{j},i}^{D}(z)=\mathcal{H}_{D_{j},i-1}^{D}(z)\cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(z), \ \mbox{if}\ z\in D, \end{aligned} $$
(6.9)

where \(\mathcal {F}_{D_{j}, \Delta G_{i-1}^{D_{j}}}^{M}(g), \mathcal {H}_{D_{j}, \Delta G_{i-1}^{D_{j}}}^{M}(m)\) and \(\mathcal {H}_{D_{j}, \Delta G_{i-1}^{D_{j}}}^{D}(z)\) are set to be empty when \(\Delta G_{i-1}^{D_{j}}=\emptyset \), and \(\mathcal {F}_{D_{j}, \Delta M_{i-1}}^{M}(g)\) and \(\mathcal {H}_{D_{j}, \Delta M_{i-1}}^{M}(m)\) are set to be empty when ΔM i−1 = ∅.

Then we say that \(\mathcal {F}_{D_{j},i}^{M},\mathcal {F}_{D_{j},i}^{D}\) and \(\mathcal {H}_{D_{j},i}^{M},\mathcal {H}_{D_{j},i}^{D}\) are respectively extended cognitive operators of \(\mathcal {F}_{D_{j},i-1}^{M},\mathcal {F}_{D_{j},i-1}^{D}\) and \(\mathcal {H}_{D_{j},i-1}^{M},\mathcal {H}_{D_{j},i-1}^{D}\) with the newly input data \(\Delta G_{i-1}^{D_{j}}\) and ΔM i−1. For convenience, cognitive operators \(\mathcal {F}_{D,i}^{M}\) and \(\mathcal {H}_{D,i}^{M}\) denote the combination of \(\mathcal {F}_{D_{1},i}^{M},\mathcal {F}_{D_{2},i}^{M},\ldots ,\mathcal {F}_{D_{l},i}^{M}\) and \(\mathcal {H}_{D_{1},i}^{M},\mathcal {H}_{D_{2},i}^{M},\ldots ,\mathcal {H}_{D_{l},i}^{M}\), respectively. Similarly, we can define \(\mathcal {F}_{D,i}^{D}\) and \(\mathcal {H}_{D,i}^{D}\).

Meanwhile, for any D j ⊆ D, \(\mathcal {G}_{\mathcal {F}_{D_{j},i-1}^{M},\mathcal {H}_{D_{j},i-1}^{M}}^C\) means subcondition-concept space under cognitive operators \(\mathcal {F}_{D_{j},i-1}^{M}\) and \(\mathcal {H}_{D_{j},i-1}^{M}\), and \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\) is called as condition-concept space under cognitive operators \(\mathcal {F}_{D,i-1}^{M}\) and \(\mathcal {H}_{D,i-1}^{M}\). In a similar manner, we can define \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{D},\mathcal {H}_{D,i-1}^{D}}^D\) and \(\mathcal {G}_{\mathcal {F}_{D_{j},i-1}^{D},\mathcal {H}_{D_{j},i-1}^{D}}^D\). In \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\), we can obtain the k-th granular concept \(\big (A_{G,k}^{D_{j}},B_{G,k}^{D_{j}}\big )\) from \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\) with a class set D j. Moreover, for dynamic information \(\Delta G_{i-1}^{D_{j}}\), we write \(\mathcal {G}_{\mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M},\mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}}^C\) and \(\mathcal {G}_{\mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D},\mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}}^D\) as \(\mathcal {G}_{\Delta G_{i-1}^{D_{j}}}^{C}\) and \(\mathcal {G}_{\Delta G_{i-1}^{D_{j}}}^{D} \ \Big (\mbox{Similarly}, \mathcal {G}_{\Delta M_{i-1}}^{C} \mbox{and}\ \mathcal {G}_{\Delta M_{i-1}}^{D} \mbox{for}\ \Delta M_{i-1} \Big )\) under operators \(\mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M},\mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}\) and \(\mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D},\mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\) \(\Big ( \mathcal {F}_{D_{j},\Delta M_{i-1}}^{M},\mathcal {H}_{D_{j},\Delta M_{i-1}}^{M}\ \) \(\mbox{and}\ \mathcal {F}_{D_{j},\Delta M_{i-1}}^{D},\mathcal {H}_{D_{j},\Delta M_{i-1}}^{D} \Big )\).

In theory, although we can update concepts by objects and attributes simultaneously, we are extremely interested in the new object information because the attributes can be regarded as relatively stable under certain conditions.

Theorem 6.1

Let \(G_{i}^{D_{j}}\) be a subclass-object set under a set D j and \(\big ( \mathcal {G}_{\mathcal {F}_{D_{j},i-1},\mathcal {H}_{D_{j},i-1}}\), \( \mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M},\mathcal {F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}, \mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}, \mathcal {H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D} \big )\) be an object-oriented cognitive computing state, where \(\mathcal {G}_{\mathcal {F}_{D_{j},i-1},\mathcal {H}_{D_{j},i-1}}\) is the concept space under cognitive operators \(\mathcal {F}_{D_{j},i-1}\) and \(\mathcal {H}_{D_{j},i-1}\) . Then the following statements hold:

$$\displaystyle \begin{aligned} &1)\ \mathit{\mbox{For any}}\ g\in G_{i}^{D_{j}}, \mathit{\mbox{if}}\ g\in G_{i-1}^{D_{j}}, \mathit{\mbox{then}}\\ &\big(\mathcal{H}_{D_{j},i}^{M}\mathcal{F}_{D_{j},i}^{M}(g),\mathcal{F}_{D_{j},i}^{M}(g)\big)=\big(\mathcal{H}_{D_{j},i-1}^{M}\mathcal{F}_{D_{j},i-1}^{M}(g)\cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M} \mathcal{F}_{D_{j},i-1}^{M}(g),\\ &\qquad \qquad \qquad \qquad \qquad \quad \ \ \mathcal{F}_{D_{j},i-1}^{M}(g) \big);\\ &\mathit{\mbox{otherwise}},\\ &\big(\mathcal{H}_{D_{j},i}^{M}\mathcal{F}_{D_{j},i}^{M}(g),\mathcal{F}_{D_{j},i}^{M}(g) \big)=\big( \mathcal{H}_{D_{j},i-1}^{M}\mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(g) \cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}\mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(g),\\ &\qquad \qquad \qquad \qquad \qquad \quad \ \ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(g) \big).\\ \end{aligned} $$
$$\displaystyle \begin{aligned} &2)\ \mathit{\mbox{For any}}\ m\in M_{i-1}, \mathit{\mbox{we have}}\\ &\big(\mathcal{H}_{D_{j},i}^{M}(m),\mathcal{F}_{D_{j},i}^{M}\mathcal{H}_{D_{j},i}^{M}(m) \big)=\big(\mathcal{H}_{D_{j},i-1}^{M}(m) \cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(m), \mathcal{F}_{D_{j},i-1}^{M}\mathcal{H}_{D_{j},i-1}^{M}(m)\cap \\ &\qquad \qquad \qquad \qquad \qquad \qquad\ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}\mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{M}(m) \big).\\ &3)\ \mathit{\mbox{For any}}\ y\in G_{i}^{D_{j}}, \mathit{\mbox{if}}\ y\in G_{i-1}^{D_{j}}, \mathit{\mbox{then}}\\ &\big(\mathcal{H}_{D_{j},i}^{D}\mathcal{F}_{D_{j},i}^{D}(y),\mathcal{F}_{D_{j},i}^{D}(y)\big)=\big( \mathcal{H}_{D_{j},i-1}^{D}\mathcal{F}_{D_{j},i-1}^{D}(y) \cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\mathcal{F}_{D_{j},i-1}^{D}(y),\\ &\qquad \qquad \qquad \qquad \qquad \quad \ \ \ \mathcal{F}_{D_{j},i-1}^{D}(y)\big);\\ &\mathit{\mbox{otherwise}},\\ &\big(\mathcal{H}_{D_{j},i}^{D}\mathcal{F}_{D_{j},i}^{D}(y),\mathcal{F}_{D_{j},i}^{D}(y) \big)=\big( \mathcal{H}_{D_{j},i-1}^{D}\mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(y) \cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(y), \\ &\qquad \qquad \qquad \qquad \qquad \quad \ \ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(y) \big).\\ &4)\ \mathit{\mbox{For any}}\ z\in D, \mathit{\mbox{we obtain}}\\ &\big(\mathcal{H}_{D_{j},i}^{D}(z),\!\mathcal{F}_{D_{j},i}^{D}\mathcal{H}_{D_{j},i}^{D}(z) \big)=\big( \mathcal{H}_{D_{j},i-1}^{D}(z) \cup \mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(z), \mathcal{F}_{D_{j},i-1}^{D}\mathcal{H}_{D_{j},i-1}^{D}(z)\cap \\ &\qquad \qquad \qquad \qquad \qquad \quad \ \ \mathcal{F}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}\mathcal{H}_{D_{j},\Delta G_{i-1}^{D_{j}}}^{D}(z) \big).\\ \end{aligned} $$

Proof

The proof of Theorem 6.1 can be found in the original paper [58]. □

From Theorem 6.1, we observe that the i-th concept space can be constructed under cognitive operators \(\mathcal {F}_{D_{j},i-1}\) and \(\mathcal {H}_{D_{j},i-1}\), and the concept space \(\mathcal {G}_{\mathcal {F}_{D_{j},i},\mathcal {H}_{D_{j},i}}\) can be obtained by \(\mathcal {G}_{\mathcal {F}_{D_{j},i-1},\mathcal {H}_{D_{j},i-1}}\) with the newly input data \(\Delta G_{i-1}^{D_{j}}\). This means that we can obtain concepts based on the past concepts rather than reconstructing them from the beginning.

However, in the previous discussions, we still do not know which class-object set \(G_{i-1}^{D_{*}}\) should be theoretically updated with \(\Delta G_{i-1}^{D_{j}}\). In other words, although we obtain class-object set \(G_{i-1}^{D_{j}}\) which will be actually updated by \(\Delta G_{i-1}^{D_{j}}\), we are not sure if the updated class-object set in the model is in accordance with \(G_{i-1}^{D_{j}}\). Thus, we will further discuss the relationship between \(G_{i-1}^{D_{j}}\) and \(G_{i-1}^{D_{*}}\).

Definition 6.11

Let \((\mathcal {H}\mathcal {F}(g),\mathcal {F}(g))\) be a granular concept, for any \((A_{G,e},B_{G,e})\in \mathcal {G}^{C}\), where \(e \in \{ 1,2,\ldots ,|\mathcal {G}^{C}| \}\). Then we can define concept-similarity degree (CS) as follows:

$$\displaystyle \begin{aligned} &\theta_{CS}=CS(\mathcal{F}(g),B_{G,e})=\frac{W_{p}\cdot M^{T}}{|\mathcal{F}(g) \cup B_{G,e}|}, \end{aligned} $$
(6.10)

where M T is the transpose of the vector M, and W p = (w 1, w 2, …, w m) is a cognitive weight vector that is associated with an attribute vector M = (m 1, m 2, …, m m) consisting of (1) the elements from \(\mathcal {F}(g) \cap B_{G,e}\) which are all set to be 1, and (2) the elements from \(M - (\mathcal {F}(g) \cap B_{G,e})\) which are all set to be 0.

Let E be training times. For any t ∈ E, the cognitive weight vector of the t-th training is denoted by \(W_{i,p}^{t}=(w_{i,1}^{t},w_{i,2}^{t},\ldots ,w_{i,m}^{t})\). Then we denote

$$\displaystyle \begin{aligned} \left[ \begin{array}{c} W_{1,p}^{t}\\ \vdots \\ W_{n,p}^{t} \end{array} \right]= \left[ \begin{array}{ccc} w_{1,1}^{t} &\cdots &w_{1,m}^{t}\\ \vdots &\cdots &\vdots \\ w_{n,1}^{t} &\cdots &w_{n,m}^{t} \end{array} \right], \end{aligned} $$
(6.11)

where \(n = \big |\bigcup\limits _{i=1}^{n}\{G_{i}\} \big |\). Our purpose is to obtain an optimal cognitive weight vector \(W_{n,p}^{t}\) by computing concept-similarity degree vectors.

Definition 6.12

Let {G i−1} be a class-object set under \(G_{i-1}^{D}\), \(\Delta G_{i-1}^{D_{*}}\) be a new object set under D , and \(G_{i-1}^{D_{j}}\) and \(G_{i-1}^{D_{r}}\) be class-object sets under class sets D j and D r (D j ∩ D r = ∅), respectively. For any granular concept \(\big (A_{G,e}^{D_{j}},B_{G,e}^{D_{j}}\big ) \in \mathcal {G}_{\mathcal {F}_{D_{j},i-1}^{M},\mathcal {H}_{D_{j},i-1}^{M}}^C\) and a new granular concept \(\big ( \mathcal {H}_{D_{*},\Delta G_{i-1}^{D_{*}}}^{M}\mathcal {F}_{D_{*},\Delta G_{i-1}^{D_{*}}}^{M}(g),\mathcal {F}_{D_{*},\Delta G_{i-1}^{D_{*}}}^{M}(g) \big )\), the degree of similarity between the concepts is defined as \(CS_{i-1}^{D_{j}}=CS\big ( B_{G,e}^{D_{j}},\mathcal {F}_{D_{*},\Delta G_{i-1}^{D_{*}}}^{M}(g) \big )\). Then, we denote

$$\displaystyle \begin{aligned} &MCS^{D_{j}}=\max\limits_{e=1}^{n}\big( CS_{i-1}^{D_{j}} \big)=\max\limits_{e=1}^{n}\Big( CS\big( B_{G,e}^{D_{j}},\mathcal{F}_{D_{*},\Delta G_{i-1}^{D_{*}}}^{M} (g) \big) \Big), \end{aligned} $$

where \(n=\Big | \mathcal {G}_{\mathcal {F}_{D_{j},i-1}^{M},\mathcal {H}_{D_{j},i-1}^{M}}^C \Big |\). Then, we further denote

$$\displaystyle \begin{aligned} \begin{gathered} MMCS^{D_{j}}=\max\limits_{j=1}^{l}\big( MCS^{D_{j}} \big). \end{gathered} \end{aligned} $$
(6.12)

From (6.12), we know that the subclass-object set \(G_{i-1}^{D_{j}}\) should be updated in the class-object set \(G_{i-1}^{D}\). Therefore, if D  = D j, it means that the theoretically updated subclass-object set \(G_{i-1}^{D_{*}}\) is in accordance with the actually updated subclass-object set \(G_{i-1}^{D_{j}}\). Otherwise, we should adjust cognitive weight vectors as follows.

$$\displaystyle \begin{aligned} &w_{i}^{t}\leftarrow w_{i}^{t} \pm \Delta w_{i}^{t}, \\ &\Delta w_{i}^{t}=activationFunction(\eta w_{i}^{t}), \end{aligned} $$
(6.13)

where the operator +  is adopted when the attributes are from \(B_{G,e}^{D_{j}} \bigcap \mathcal {F}_{D_{*}\Delta G_{i-1}^{D_{*}}}^{M}(g)\) and the another operator − is used for the elements from \(B_{G,e}^{D_{r}} \bigcap \mathcal {F}_{D_{*}\Delta G_{i-1}^{D_{*}}}^{M}(g)\), and \(activationFunction(\eta w_{i}^{t})=\frac {exp(\eta w_{i}^{t})-exp(-\eta w_{i}^{t})}{exp(\eta w_{i}^{t})+exp(-\eta w_{i}^{t})}\) with the learning rate η ∈ (0, 1).

1.1.3 Proposed Model

In this section, based on the above discussion, we put forward a CCLM with dynamic learning, which can perform a good performance in incremental learning and classification task.

1.1.3.1 A. Initial Concept Learning

We split raw data into training data \(\overline {G}\) and testing data \(\overline {\overline {G}}\). For the training data, let \(\{ \overline {G}\}\) be the set of the objects sets \(\overline {G}_{1},\overline {G}_{2},\ldots ,\overline {G}_{n}\) with \(\overline {G}_{i}\cap \overline {G}_{j}=\emptyset (i\neq j)\), we denote

$$\displaystyle \begin{aligned} \{\overline{G}\}=\bigcup_{i=1}^{n}\{\overline{G}_{i}\}. \end{aligned} $$
(6.14)

Here, \(\overline {G}_{1}\) is an initial training data, and the rest of training data \(\bigcup\limits _{i=2}^{n}\{\overline {G}_{i}\}\) is used for concept cognition.

From Definitions 6.7 and 6.8, the initial concept learning consists of two parts: constructing condition-concept space and decision-concept space. The details are shown in Algorithm 6.1, and its time complexity is \(O\big ( |\{\overline {G}_{1}\}|(|M|+|D|+|\overline {G}_{1}^{D_{j}}|) \big )\).

Algorithm 6.1 Initial concept learning

1.1.3.2 B. Concept-Cognitive Process

Let E, err 0, W, AW, IW be the training epochs, learning error rate, cognitive weight vector, active weight vector and inhibited weight vector, respectively. It should be pointed out that AW and IW are to enhance and weaken the corresponding attributes, respectively. Based on the theory in Sect. 6.1.1 Theoretical Foundation, the concept-cognitive process can be briefly represented as follows:

Firstly, construct a conditional granular concept \(\big ( A_{G,k}^{D_{*}},B_{G,k}^{D_{*}} \big )\) and a decision granular concept \(\big ( Y_{G,k}^{D_{*}},Z_{G,k}^{D_{*}} \big )\).

Secondly, for a new concept \(\big ( A_{G,k}^{D_{*}},B_{G,k}^{D_{*}} \big )\), we compute its concept-similarity degree with each granular concept \(\big ( A_{G,e}^{D_{j}},B_{G,e}^{D_{j}} \big )\) from \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\).

Thirdly, if the predicted label is not in accordance with the actual label, the weight vectors W, AW and IW will be updated.

Finally, for the first training, we will update the condition-concept space and decision-concept space by dynamic concepts \(\big ( A_{G,k}^{D_{*}},B_{G,k}^{D_{*}} \big )\) and \(\big ( Y_{G,k}^{D_{*}},Z_{G,k}^{D_{*}} \big )\), respectively. Using recursive approach, we can obtain a final cognitive weight vector \(W_{n,p}^{E}\) and a final concept space (i.e., the condition-concept space \(\mathcal {G}_{\mathcal {F}_{D,n}^{M},\mathcal {H}_{D,n}^{M}}^C\) and decision-concept space \(\mathcal {G}_{\mathcal {F}_{D,n}^{D},\mathcal {H}_{D,n}^{D}}^D\)).

The details of concept-cognitive process are shown in Algorithm 6.2. Note that params[1], params[2] and params[3] are \(\Big (\big ( A_{G,k}^{D_{*}},B_{G,k}^{D_{*}} \big ), \mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\), \(\mathcal {G}_{\mathcal {F}_{D,i-1}^{D},\mathcal {H}_{D,i-1}^{D}}^D\), \( W_{i-1,p}^{t-1} \Big ),\Big (\eta ,j,type, B_{G,k}^{D_{type}}, \theta L_{max}[|D|], W_{i-1,p}^{t-1}, AW_{i-1,p}^{t-1}, IW_{i-1,p}^{t-1} \Big )\)and \(\Big ( \mathcal {G}_{\mathcal {F}_{D,i-1}^{M},\mathcal {H}_{D,i-1}^{M}}^C\), \( \mathcal {G}_{\mathcal {F}_{D,i-1}^{D},\mathcal {H}_{D,i-1}^{D}}^D, \big ( A_{G,k}^{D_{type}}, B_{G,k}^{D_{type}} \big ),\big (Y_{G,k}^{D_{type}}, Z_{G,k}^{D_{type}} \big )\Big )\), respectively.

Algorithm 6.2 Concept-cognitive process

Now, we analyze the time complexity of Algorithm 6.2. Running Step 18 takes O(1) because of updating objects one by one in CCLM. In Step 20, it will revoke Algorithm 6.3, and the running time is decided by two for loops. Thus, running Steps 18–26 takes \(O\Big ( \big |\mathcal {G}_{\mathcal {F}_{D,i-1}^{M}\!,\mathcal {H}_{D,i-1}^{M}}^C \big |\big |\mathcal {G}_{\mathcal {F}_{D,i-1}^{D}\!,\mathcal {H}_{D,i-1}^{D}}^D\big |\Big )\), where \(\big |\mathcal {G}_{\mathcal {F}_{D,i-1}^{D}\!,\mathcal {H}_{D,i-1}^{D}}^D\big |\) is the number of |D| and often very small. For Steps 27–32, it will call Algorithms 6.4 and 6.5. Therefore, the time complexity of Steps 27–32 is O(|D|((|activeSet| + |inhibitSet|) + (|M| + |D|))). To sum up, the time complexity of Algorithm 6.2 is \(O\Big ( P\big |\bigcup\limits _{i=2}^{n}\{ \overline {G}_{i} \}\big | \) \( \big ( \big | \overline {G}_{i}^{D_{*}} \big |+\big |\mathcal {G}_{\mathcal {F}_{D,i-1}^{M}\!,\mathcal {H}_{D,i-1}^{M}}^C \big ||D|+Q \big ) \Big )\) \(\big ( P=max\{ E,E_{err_{0}}\}, Q=(|M|+|D|)(|D|+1)+|D|(|activeSet|+|inhibitSet|) \big )\), where E is the number of training epochs and \(E_{err_{0}}\) is the running times about err 0.

Algorithm 6.3 Concept-similarity degree

Algorithm 6.4 Adjust weight

Algorithm 6.5 Update concept space

1.1.3.3 C. Overall Procedure and Concept Prediction

Figure 6.1 shows the overall procedure of CCLM which includes three stages: initial concept generation, concept-cognitive process and concept prediction. Suppose there are still three classes to predict. The stage of initial concept generation is to generate concept space by mapping objects into concepts, and then the second stage will update the concept space by the concept-similarity degree with labeled data.

Fig. 6.1
figure 1

Illustration of overall procedure for CCLM. Suppose there are three classes to predict, and the maximum class vector is obtained by concept-similarity degree

In the stage of concept prediction, for any test instance, concept-similarity degree is further used to compute similarity degree, and then the final prediction will be completed by the sum of the maximum class vector as shown in the right of Fig. 6.1. Note that, compared with the second stage, the concept space will not be updated in the third stage.

Based on the final concept space \(\mathcal {G}_{\mathcal {F}_{D,n}^{M}\!,\mathcal {H}_{D,n}^{M}}^C, \mathcal {G}_{\mathcal {F}_{D,n}^{D}\!,\mathcal {H}_{D,n}^{D}}^D\), and the final weight vector \(W_{n,p}^{E}\), we can make predictions in \(\overline {\overline {G}}\). The details are described in Algorithm 6.6. Considering that running Step 6 will revoke the function of Algorithm 6.3, it is easy to verify that the time complexity of Algorithm 6.6 is \(O\big ( \big |\overline {\overline {G}}\big | \big |\mathcal {G}_{\mathcal {F}_{D,i-1}^{M}\!,\mathcal {H}_{D,i-1}^{M}}^C\big | \big |D\big | \big )\).

Algorithm 6.6 Concept prediction

1.2 Concurrent Concept-Cognitive Learning Model for Classification

In this subsection, we discuss the design of a new theoretical framework for concurrent computing, which comprises three aspects: initial concurrent concept learning, the concurrent concept-cognitive process, and the concept generalization process.

1.2.1 Initial Concurrent Concept Learning in C3LM

In the real world, not all methods can be concurrent, as this often depends on their separability. In order to guarantee concurrency for the C3LM in theory, we need to consider the following definitions and propositions.

Definition 6.13

Let (G, M, I, D, J) be a regular formal decision context. Suppose that D 1, D 2, …, D K is a partition of D by class labels, and let \(G= G^{D_{1}} \cup G^{D_{2}} \cup \ldots \cup G^{D_{K}}\). Then, we say that \(G^{D_{k}}\ (k\in \{1,2,\ldots ,K\})\) is a subclass-object set. For the sake of brevity, hereinafter we write \(G^{D_{k}}\) as G k.

Definition 6.13 indicates that an object set G can be decomposed into several subclass-object sets in a regular formal decision context. Moreover, we only consider objects that are updated by newly input objects, as attributes can be taken as relatively stable in real life. Therefore, in the following, we discuss the scenario of a subclass-object G k.

Let G k be a subclass-object set, and M and D be attribute sets. The set-valued mappings \(\mathcal {F}^{k}:2^{G^{k}}\rightarrow 2^{M},\mathcal {H}^{k}:2^{M}\rightarrow 2^{G^{k}}\) and \(\widetilde {\mathcal {F}}^{k}:2^{G^{k}}\rightarrow 2^{D},\widetilde {\mathcal {H}}^{k}:2^{D}\rightarrow 2^{G^{k}}\) are respectively referred to as the conditional and decision cognitive operators with a subclass-object set G k when no confusion exists.

Definition 6.14

Let \(G^{k}_{1},G^{k}_{2},\ldots ,G^{k}_{n}\) be a partition of an object set G k. If the following cognitive operators:

$$\displaystyle \begin{aligned} &\mathcal{F}^{k}_{j}:2^{G^{k}_{j}}\rightarrow 2^{M}, \qquad\mathcal{H}^{k}_{j}:2^{M}\rightarrow 2^{G^{k}_{j}}, j=1,2,\ldots,n, \\ &\mathcal{F}^{k}:2^{G^{k}}\rightarrow 2^{M}, \qquad\mathcal{H}^{k}:2^{M}\rightarrow 2^{G^{k}} \end{aligned}$$

satisfy \(\mathcal {F}^{k}(g)=\mathcal {F}^{k}_{j}(g)\), where \(g\in G^{k}_{j}\), we say that \(\mathcal {H}\mathcal {S}_{\mathcal {F}^{k}\mathcal {H}^{k}}=(\mathcal {F}^{k}_{1},\ldots ,\mathcal {F}^{k}_{n}; \mathcal {H}^{k}_{1},\ldots \), \(\mathcal {H}^{k}_{n})\) is a conditional horizontal partition state.

Proposition 6.1

Let \(\mathcal {H}\mathcal {S}_{\mathcal {F}^{k}\mathcal {H}^{k}}=(\mathcal {F}^{k}_{1},\ldots ,\mathcal {F}^{k}_{n}; \mathcal {H}^{k}_{1},\ldots ,\mathcal {H}^{k}_{n})\) be a conditional horizontal partition state. For any \(g \in G^{k}_{j_{1}}\ (j_{1} \in \{1,2,\ldots ,n\})\) , if there exist objects \(g_{1},g_{2},\ldots ,g_{n} \in G^{k}_{j_{2}}\ (j_{2} \in \{1,2,\ldots ,n\})\) such that \(\mathcal {F}^{k}_{j_{1}}(g) \subseteq \mathcal {F}^{k}_{j_{2}}(g_{i})\ (i=1,2,\ldots ,n)\) , we have

$$\displaystyle \begin{aligned} (\mathcal{H}^{k}\mathcal{F}^{k}(g),\mathcal{F}^{k}(g))=\Big(\big\{g\cup (\mathop{\cup}\limits_{i=1}^{n} g_{i})\big\},\mathcal{F}^{k}_{j_{1}}(g)\Big); \end{aligned} $$
(6.15)

otherwise,

$$\displaystyle \begin{aligned} (\mathcal{H}^{k}\mathcal{F}^{k}(g),\mathcal{F}^{k}(g))=(\{g\},\mathcal{F}^{k}_{j_{1}}(g)). \end{aligned} $$
(6.16)

Proof

The proof of Proposition 6.1 can be found in the original paper [60]. □

In fact, from the perspective of objects, Definition 6.14 and Proposition 6.1 demonstrate that the separability holds for C3LM in the conditional formal context (G, M, I). Analogously, we can determine that the separability also holds for C3LM in the decision formal context (G, D, J) under the decision cognitive operators \(\widetilde {\mathcal {F}}^{k}\) and \(\widetilde {\mathcal {H}}^{k}\).

Definition 6.15

Let M 1, M 2, …, M d be a partition of M. For any G k ⊆ G, if the following cognitive operators:

$$\displaystyle \begin{aligned} &\mathcal{F}^{k}_{j}:2^{G^{k}}\rightarrow 2^{M_{j}}, \qquad\mathcal{H}^{k}_{j}:2^{M_{j}}\rightarrow 2^{G^{k}}, j=1,2,\ldots,d, \\ &\mathcal{F}^{k}:2^{G^{k}}\rightarrow 2^{M}, \qquad \, \mathcal{H}^{k}:2^{M}\rightarrow 2^{G^{k}} \end{aligned} $$

satisfy \(\mathcal {F}^{k}\mathcal {H}^{k}(m)=\bigcup\limits _{j=1}^{d}\mathcal {F}^{k}_{j}\mathcal {H}^{k}(m)\) where m ∈ M, we say that \(\mathcal {V}\mathcal {S}_{\mathcal {F}^{k}\mathcal {H}^{k}}=(\mathcal {H}^{k}_{1},\ldots ,\mathcal {H}^{k}_{d}; \mathcal {F}^{k}_{1},\ldots ,\mathcal {F}^{k}_{d})\) is a conditional vertical partition state.

Proposition 6.2

Let \(\mathcal {V}\mathcal {S}_{\mathcal {F}^{k}\mathcal {H}^{k}}=(\mathcal {H}^{k}_{1},\ldots ,\mathcal {H}^{k}_{d}; \mathcal {F}^{k}_{1},\ldots ,\mathcal {F}^{k}_{d})\) be a conditional vertical partition state. For any \(m \in M_{j_{1}} \ (j_{1} \in \{1,2,\ldots ,d\})\) , if there exist attributes \(m_{1},m_{2},\ldots ,m_{r} \in M_{j_{2}}\ (j_{2} \in \{1,2,\ldots ,d\})\) such that \(\mathcal {H}^{k}_{j_{1}}(m) \subseteq \mathcal {H}^{k}_{j_{2}}(m_{i})\ (i=1,2,\ldots ,r)\) , we have

$$\displaystyle \begin{aligned} (\mathcal{H}^{k}(m),\mathcal{F}^{k}\mathcal{H}^{k}(m))=\Big(\mathcal{H}^{k}(m),\big\{m\cup (\mathop{\cup}\limits_{i=1}^{r} m_{i})\big\}\Big); \end{aligned} $$
(6.17)

otherwise,

$$\displaystyle \begin{aligned} (\mathcal{H}^{k}(m),\mathcal{F}^{k}\mathcal{H}^{k}(m))=(\mathcal{H}^{k}(m),\{m\}). \end{aligned} $$
(6.18)

Proof

The proof of Proposition 6.2 can also be found in the original paper [60]. □

From Definition 6.15 and Proposition 6.2, we know that the separability holds for C3LM in the conditional formal context (G, M, I) from the attribute perspective. Similarly, under decision cognitive operators \(\widetilde {\mathcal {F}}^{k}\) and \(\widetilde {\mathcal {H}}^{k}\), there exists the same property for C3LM in the decision formal context (G, D, J).

Algorithm 6.7 Concurrent computation of initial concept space

Based on the above theory, we present an initial concurrent computing framework (see Fig. 6.2 for details) and its corresponding algorithm (see Algorithm 6.7) for constructing the initial concepts. The overall process in Fig. 6.2 can be described as follows: first, a task can be divided into many subtasks by the recursion method, based on Definitions 6.14 and 6.15, and Propositions 6.1 and 6.2. Second, according to Propositions 6.1 and 6.2, threads can concurrently calculate the concepts of each task. Finally, the results of different threads will be collected by Propositions 6.1 and 6.2. Moreover, the right of Fig. 6.2 illustrates that four threads calculate granular concepts based on the object and attribute sets. It should be pointed out that the proposed C3LM is based on the fork/join framework.Footnote 1

Fig. 6.2
figure 2

Framework of constructing initial concepts in C3LM

Furthermore, it is easy to determine that the time complexity of Algorithm 6.7 is \(O(\frac {1}{n}|G^{k}|+\frac {1}{d}|M|+\frac {1}{l}|D|)\). For an object set G, by means of Algorithm 6.7, we can obtain the conditional concept space \(\mathcal {G}_{\mathcal {F}\mathcal {H}}^{C}\) and decision concept space \(\mathcal {G}_{\widetilde {\mathcal {F}}\widetilde {\mathcal {H}}}^{D}\).

1.2.2 Concurrent Concept-Cognitive Process in C3LM

In the real world, objects will be updated as time passes, which means that the obtained concept spaces need to be updated accordingly. For a person, learning is not simply a matter of acquiring a description, but involves taking something new and integrating it sufficiently with the existing thought processes [41]. The learning ability in humans is known as a gradual cognitive process. Therefore, in this subsection, we explore the concept-cognitive process under a concurrent environment.

As with the classical cognitive process [36], combining Definitions 6.6 and 6.13, we obtain the cognitive operators for C3LM with the newly input objects \(\Delta G^{k}_{i-1}=G^{k}_{i}-G^{k}_{i-1}\), as follows:

$$\displaystyle \begin{aligned} &\mathrm{(i)} \qquad \, \mathcal{F}^{k}_{i-1}:2^{G^{k}_{i-1}}\rightarrow 2^{M}, \qquad \quad \, \mathcal{H}^{k}_{i-1}:2^{M}\rightarrow 2^{G^{k}_{i-1}}, \\ &\mathrm{(ii)} \qquad\mathcal{F}^{k}_{\Delta G^{k}_{i-1}}:2^{\Delta G^{k}_{i-1}}\rightarrow 2^{M}, \qquad \!\mathcal{H}^{k}_{\Delta G^{k}_{i-1}}:2^{M}\rightarrow 2^{\Delta G^{k}_{i-1}}, \\ &\mathrm{(iii)} \qquad \! \mathcal{F}^{k}_{i}:2^{G^{k}_{i}}\rightarrow 2^{M}, \qquad \qquad \! \mathcal{H}^{k}_{i}:2^{M}\rightarrow 2^{G^{k}_{i}}, \end{aligned} $$
(6.19)

and

$$\displaystyle \begin{aligned} &\mathrm{(iv)} \qquad \! \widetilde{\mathcal{F}}^{k}_{i-1}:2^{G^{k}_{i-1}}\rightarrow 2^{D}, \qquad \quad \, \widetilde{\mathcal{H}}^{k}_{i-1}:2^{D}\rightarrow 2^{G^{k}_{i-1}}, \\ &\mathrm{(v)} \qquad \, \widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}:2^{\Delta G^{k}_{i-1}}\rightarrow 2^{D}, \qquad \!\widetilde{\mathcal{H}}^{k}_{\Delta G^{k}_{i-1}}:2^{D}\rightarrow 2^{\Delta G^{k}_{i-1}}, \\ &\mathrm{(vi)} \qquad\widetilde{\mathcal{F}}^{k}_{i}:2^{G^{k}_{i}}\rightarrow 2^{D}, \qquad \qquad \! \widetilde{\mathcal{H}}^{k}_{i}:2^{D}\rightarrow 2^{G^{k}_{i}}. \end{aligned} $$
(6.20)

Definition 6.16

Let \(\Delta G^{k}_{i-1}=G^{k}_{i}-G^{k}_{i-1}\) be a singleton set with a new object, and \(\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}\),\(\mathcal {H}^{k}_{\Delta G^{k}_{i-1}}\) and \(\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}\),\(\widetilde {\mathcal {H}}^{k}_{\Delta G^{k}_{i-1}}\) be cognitive operators. For any \(g \in \Delta G^{k}_{i-1}\), if \(\big (\mathcal {H}^{k}_{\Delta G^{k}_{i-1}}\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g),\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g)\big ) = \big (\{g\},\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\) and \(\big (\widetilde {\mathcal {H}}^{k}_{\Delta G^{k}_{i-1}}\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g),\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g)\big ) \) \( = \big (\{g\},\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\), \(\big (\mathcal {H}^{k}_{\Delta G^{k}_{i-1}}\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g), \mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\) and \(\big (\widetilde {\mathcal {H}}^{k}_{\Delta G^{k}_{i-1}}\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g),\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\) are referred to as the newly formed conditional atomic concept and decision atomic concept, respectively, with a single object g.

In fact, we consider that the obtained concept spaces \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{C}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{D}\) are updated by a newly input object, rather than adding multiple objects simultaneously. For the sake of convenience, we denote the initial concept spaces obtained by Algorithm 6.7, namely \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{C}\),\(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{D}\) and \(\mathcal {G}_{\mathcal {F}\mathcal {H}}^{C}\),\(\mathcal {G}_{\widetilde {\mathcal {F}}\widetilde {\mathcal {H}}}^{D}\), as \(\mathcal {G}_{\mathcal {F}^{k}_{0}\mathcal {H}^{k}_{0}}^{C}\),\(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}_{0}\widetilde {\mathcal {H}}^{k}_{0}}^{D}\) and \(\mathcal {G}_{\mathcal {F}_{0}\mathcal {H}_{0}}^{C}\),\(\mathcal {G}_{\widetilde {\mathcal {F}}_{0}\widetilde {\mathcal {H}}_{0}}^{D}\), respectively. According to Eqs. (6.19) and (6.20), the cognitive operators \(\mathcal {F}^{k}_{i}\),\(\mathcal {H}^{k}_{i}\) and \(\widetilde {\mathcal {F}}^{k}_{i}\),\(\widetilde {\mathcal {H}}^{k}_{i}\) in the i-th period can be obtained by the cognitive operators \(\mathcal {F}^{k}_{i-1}\),\(\mathcal {H}^{k}_{i-1}\) and \(\widetilde {\mathcal {F}}^{k}_{i-1}\),\(\widetilde {\mathcal {H}}^{k}_{i-1}\) in the (i − 1)-th period with incremental objects, respectively. Moreover, we denote their corresponding concept spaces by \(\mathcal {G}_{\mathcal {F}^{k}_{i}\mathcal {H}^{k}_{i}}^{C}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}_{i}\widetilde {\mathcal {H}}^{k}_{i}}^{D}\). Furthermore, the entire concept spaces in the i-th period are further denoted by \(\mathcal {G}_{\mathcal {F}_{i}\mathcal {H}_{i}}^{C}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}_{i}\widetilde {\mathcal {H}}_{i}}^{D}\).

Proposition 6.3

Let \(\big (\mathcal {H}^{k}_{\Delta G^{k}_{i-1}}\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g),\mathcal {F}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\) and \(\big (\widetilde {\mathcal {H}}^{k}_{\Delta G^{k}_{i-1}} \widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g),\widetilde {\mathcal {F}}^{k}_{\Delta G^{k}_{i-1}}(g)\big )\) be the newly formed conditional and decision atomic concepts, respectively. Then, the following statements hold:

$$\displaystyle \begin{aligned} &\mathit{\mbox{ (i) For any granular concept}}\ (A_{k,j},B_{k,j}) \in \mathcal{G}_{\mathcal{F}^{k}_{i-1}\mathcal{H}^{k}_{i-1}}^{C} (j\in \{1,2,\ldots,|\mathcal{G}_{\mathcal{F}^{k}_{i-1}\mathcal{H}^{k}_{i-1}}^{C}|\}), \mathit{\mbox{if}}\\ &B_{k,j}\!\cap \mathcal{F}^{k}_{\Delta G^{k}_{i-1}}(g) \neq \emptyset, (A_{k,j},B_{k,j})=\big(A_{k,j}\! \cup \mathcal{H}^{k}_{\Delta G^{k}_{i-1}}\mathcal{F}^{k}_{\Delta G^{k}_{i-1}}(g),B_{k,j}\! \cap \mathcal{F}^{k}_{\Delta G^{k}_{i-1}}(g) \big);\\ &\mathit{\mbox{otherwise,}}\\ & \mathcal{G}_{\mathcal{F}^{k}_{i}\mathcal{H}^{k}_{i}}^{C}=\mathcal{G}_{\mathcal{F}^{k}_{i-1}\mathcal{H}^{k}_{i-1}}^{C} \cup \big(\mathcal{H}^{k}_{\Delta G^{k}_{i-1}}\mathcal{F}^{k}_{\Delta G^{k}_{i-1}}(g),\mathcal{F}^{k}_{\Delta G^{k}_{i-1}}(g)\big).\\ &\mathit{\mbox{ (ii) For any granular concept}}\ (Y_{k,j},Z_{k,j}) \in \mathcal{G}_{\widetilde{\mathcal{F}}^{k}_{i-1}\widetilde{\mathcal{H}}^{k}_{i-1}}^{D} (j\in \{1,2,\ldots,|\mathcal{G}_{\widetilde{\mathcal{F}}^{k}_{i-1}\widetilde{\mathcal{H}}^{k}_{i-1}}^{D}|\}), \mathit{\mbox{if}}\\ &Z_{k,j} \cap \widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}(g) \neq \emptyset, (Y_{k,j},Z_{k,j}) =\big(Y_{k,j} \cup \widetilde{\mathcal{H}}^{k}_{\Delta G^{k}_{i-1}}\widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}(g),Z_{k,j} \cap \widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}(g) \big);\\ &\mathit{\mbox{otherwise,}}\\ &\mathcal{G}_{\widetilde{\mathcal{F}}^{k}_{i}\widetilde{\mathcal{H}}^{k}_{i}}^{D}=\mathcal{G}_{\widetilde{\mathcal{F}}^{k}_{i-1}\widetilde{\mathcal{H}}^{k}_{i-1}}^{D} \cup \big(\widetilde{\mathcal{H}}^{k}_{\Delta G^{k}_{i-1}}\widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}(g),\widetilde{\mathcal{F}}^{k}_{\Delta G^{k}_{i-1}}(g)\big). \end{aligned} $$

Proof

The proof of Proposition 6.3 can be found in the original paper [ 60 ].

For any object g, according to Definitions 6.5 and 6.6, we can obtain a concept \(\big (\{g\},\mathcal {F}_{\Delta G_{i-1}}(g)\big )\), as the concept spaces are updated by adding objects sequentially. The concept similarity (CS) degree [58] is used in this study to explore the interaction of attributes in the concept-cognitive process.

Definition 6.17 ([58])

Suppose that \(\big (\{g\},\mathcal {F}_{\Delta G_{i-1}}(g)\big )\) is a new concept. For any \((A_{k,j},B_{k,j}) \in \mathcal {G}_{\mathcal {F}^{k}_{i-1},\mathcal {H}^{k}_{i-1}}^C (j \in \{ 1,2,\ldots ,|\mathcal {G}_{\mathcal {F}^{k}_{i-1},\mathcal {H}^{k}_{i-1}}^C|\})\), the CS degree can be defined as follows:

$$\displaystyle \begin{aligned} \theta_{k,j}=\frac{W\cdot M^{T}}{\big|\mathcal{F}_{\Delta G_{i-1}}(g) \bigcup B_{k,j}\big|}, \end{aligned} $$
(6.21)

where W = (w 1, w 2, …, w m) is a cognitive weight vector regarding a conditional attribute set M, and M = (m 1, m 2, …, m m) is an attribute vector that contains (1) the value of attributes from \(\mathcal {F}_{\Delta G_{i-1}}(g) \cap B_{k,j}\), which are set to 1, and (2) the elements from \(M-(\mathcal {F}_{\Delta G_{i-1}}(g) \cap B_{k,j})\), which are all set to 0.

For any object, there always exists a unique class that is most similar to it by the sample separation axiom [79]. Thus, based on Definition 6.17, we can determine the maximum CS degree \(\theta _{k,j^{*}}^{*}=\max\limits _{j\in \{1,2,\ldots ,|\mathcal {G}_{\mathcal {F}^{k}_{i-1},\mathcal {H}^{k}_{i-1}}^C|\}}\{\theta _{k,j}\}\) and its corresponding concept \((A_{k,j^{*}},B_{k,j^{*}})\) in the concept space \(\mathcal {G}_{\mathcal {F}^{k}_{i-1},\mathcal {H}^{k}_{i-1}}^C\). Moreover, for the entire concept space \(\mathcal {G}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}^C\), we can further determine the global maximum CS degree \(\theta _{k^{*},j^{*}}^{*}=\max\limits _{k\in \{1,2,\ldots ,K\}}\{\theta _{k,j^{*}}^{*}\}\) and its corresponding concept \((A_{k^{*},j^{*}},B_{k^{*},j^{*}})\).

Definition 6.18

If \(\theta _{k^{*},j^{*}}^{*}\) is the global maximum CS degree in the entire concept space \(\mathcal {G}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}^C\), we say that a new concept \(\big (\{g\},\mathcal {F}_{\Delta G_{i-1}}(g)\big )\) can be classified into the concept space \(\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}^C\) by the optimal concept \((A_{k^{*},j^{*}},B_{k^{*},j^{*}})\). Moreover, for any \((Y_{k},Z_{k}) \in \mathcal {G}_{\widetilde {\mathcal {F}}_{i-1},\widetilde {\mathcal {H}}_{i-1}}^D (k \in \{ 1,2,\ldots ,|\mathcal {G}_{\widetilde {\mathcal {F}}_{i-1},\widetilde {\mathcal {H}}_{i-1}}^D|\})\), if \(A_{k^{*},j^{*}} \subseteq Y_{k}\), we say that the object g is associated with a single label z, where Z k = {z} in a regular formal decision context.

From Definition 6.18, we can determine that an object g is associated with a class label z if and only if the real class label \(\widetilde {\mathcal {F}}_{\Delta G_{i-1}}(g)\) is consistent with the predicted class label z. However, when the ground truth label is not the same as the predicted value, we adjust the cognitive weight as follows:

$$\displaystyle \begin{aligned} &w_{i}\leftarrow w_{i} \pm \Delta w_{i}, \\ &\Delta w_{i}=activationFunction(\eta w_{i}), \end{aligned} $$
(6.22)

where the operator +  is adopted when the attributes are from \(\mathcal {F}_{\Delta G_{i-1}}(g) \cap B_{k^{*},j^{*}}\), and the other operator − is used for the elements from \(\mathcal {F}_{\Delta G_{i-1}}(g) \cap B_{k,j^{*}}\). Moreover, \(activationFunction(\eta w_{i})=\frac {e^{\eta w_{i}}-e^{-\eta w_{i}}}{e^{\eta w_{i}}+e^{-\eta w_{i}}}\), where η ∈ (0, 1) is known as the learning rate.

In the following, a computational procedure for a concurrent concept-cognitive process (see Algorithm 6.8) is proposed based on the above discussion. The inputs of Algorithm 6.8 are the concept spaces obtained from the output results of Algorithm 6.7. In Algorithm 6.8, running steps 9 and 12 requires \(O\big (|\mathcal {G}_{\mathcal {F}^{k}_{i-1}\mathcal {H}^{k}_{i-1}}^{C}|\big )\) and \(O\big (|\mathcal {G}_{\widetilde {\mathcal {F}}^{k}_{i-1}\widetilde {\mathcal {H}}^{k}_{i-1}}^{D}|\big )\), respectively. In line 15, the runtime is \(O\big (|\mathcal {G}_{\mathcal {F}^{k}_{i-1}\mathcal {H}^{k}_{i-1}}^{C}|\big )\). Hence, it is easy to determine that the time complexity of Algorithm 6.8 is \(O\big (n(\frac {1}{m} |\mathcal {G}_{\mathcal {F}^{k}_{i-1}\mathcal {H}^{k}_{i-1}}^{C}| + \frac {1}{p}|\mathcal {G}_{\widetilde {\mathcal {F}}^{k}_{i-1}\widetilde {\mathcal {H}}^{k}_{i-1}}^{D}|+ |\mathcal {G}_{\widetilde {\mathcal {F}}_{i-1},\widetilde {\mathcal {H}}_{i-1}}^D|)\big )\). Then, we can obtain the collections of all conditional and decision concepts in the final period, which are denoted by \(\mathcal {G}_{\mathcal {F}_{n},\mathcal {H}_{n}}^C\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}_{n},\widetilde {\mathcal {H}}_{n}}^D\), respectively.

Algorithm 6.8 Concurrent concept-cognitive process

1.2.3 Concept Generalization Process in C3LM

Based on the final concept spaces obtained, we can achieve classification ability. This can be understood in terms of two aspects: (1) it can complete the static classification task when the final concept spaces are directly obtained from the initial concept learning, and (2) by combining the initial concept construction process with the CCL process, it is suitable for the dynamic classification task. However, both methods predict label information by means of the CS degree.

For a test instance g, let ΔG i−1 = {g}, and we obtain a new concept \(\big (\mathcal {H}_{\Delta G_{i-1}}\mathcal {F}_{\Delta G_{i-1}}\ (g),\mathcal {F}_{\Delta G_{i-1}}(g)\big ) = \big (\{g\},\mathcal {F}_{\Delta G_{i-1}}(g)\big )\) by Definitions 6.5 and 6.16. Furthermore, according to Definitions 6.17 and 6.18, a procedure is proposed for the concept generalization task (see Algorithm 6.9). It is easy to determine that the time complexity of Algorithm 6.9 is \(O\big (|\overline {G}|(| \mathcal {G}_{\mathcal {F}_{n},\mathcal {H}_{n}}^C | + | \mathcal {G}_{\widetilde {\mathcal {F}}_{n}, \widetilde {\mathcal {H}}_{n}}^D |)\big )\).

Algorithm 6.9 Generalization process

1.3 Semi-Supervised Concept Learning by Concept-Cognitive Learning and Conceptual Clustering

In this subsection, we will first introduce the initial concept spaces with labeled data, and then the concept-cognitive process with unlabeled data, followed by the concept recognition and theoretical analysis of S2CL. Finally, we present the whole procedure and computational cost of our methods.

1.3.1 Concept Space with Structural Information

Definition 6.19

Suppose G k is a sub-object set which is associated with a label k, and a quintuple (G k, M, I, D, J) is known as a regular sub-object formal decision context. Then (G k, M, I) and (G k, D, J) are respectively called the conditional sub-object formal context and decision sub-object formal context.

Moreover, the set-valued mappings \(\mathcal {F}^{k}:2^{G^{k}}\rightarrow 2^{M}, \mathcal {H}^{k}:2^{M}\rightarrow 2^{G^{k}}\), and \(\widetilde {\mathcal {F}}^{k}:2^{G^{k}}\rightarrow 2^{D}, \widetilde {\mathcal {H}}^{k}:2^{D}\rightarrow 2^{G^{k}}\) are respectively called the conditional sub-object cognitive operators and decision sub-object cognitive operators with a sub-object set G k.

Definition 6.20

Let (G k, M, I) be a conditional sub-object formal context, and \(\mathcal {F}^{k}\), \(\mathcal {H}^{k}\) be the conditional sub-object cognitive operators. For any x′, x″ ∈ G k, if \(\mathcal {H}^{k}\mathcal {F}^{k}(x')=\{x'\}\) and \(\mathcal {H}^{k}\mathcal {F}^{k}(x'') \supset \{x''\}\), then the pairs \((\mathcal {H}^{k}\mathcal {F}^{k}(x'),\mathcal {F}^{k}(x'))\) and \((\mathcal {H}^{k}\mathcal {F}^{k}(x''),\mathcal {F}^{k}(x''))\) are referred to as object-oriented conditional granular concepts (or simply object-oriented conditional concepts). For convenience, we denote

$$\displaystyle \begin{aligned} \mathcal{O}\mathcal{G}_{\mathcal{F}^{k}\mathcal{H}^{k}}= &\{(\mathcal{H}^{k}\mathcal{F}^{k}(x'),\mathcal{F}^{k}(x'))|x'\in G^{k}\} \cup \\ &\{(\mathcal{H}^{k}\mathcal{F}^{k}(x''),\mathcal{F}^{k}(x''))|x''\in G^{k}\}. \end{aligned} $$

Simultaneously, for any a′, a″ ∈ M, if \(\mathcal {F}^{k}\mathcal {H}^{k}(a')=\{a'\}\) and \(\mathcal {F}^{k}\mathcal {H}^{k}(a'') \supset \{a''\}\), then the pairs \((\mathcal {H}^{k}(a'),\mathcal {F}^{k}\mathcal {H}^{k}(a'))\) and \((\mathcal {H}^{k}(a''),\mathcal {F}^{k}\mathcal {H}^{k}(a''))\) are called attribute-oriented conditional granular concepts (or simply attribute-oriented conditional concepts). For brevity, we further denote

$$\displaystyle \begin{aligned} \mathcal{A}\mathcal{G}_{\mathcal{F}^{k}\mathcal{H}^{k}}= &\{(\mathcal{H}^{k}(a'),\mathcal{F}^{k}\mathcal{H}^{k}(a'))|a'\in M\} \cup \\ &\{(\mathcal{H}^{k}(a''),\mathcal{F}^{k}\mathcal{H}^{k}(a''))|a''\in M\}. \end{aligned} $$

Definition 6.21

Let (G k, D, J) be a decision sub-object formal context and \(\widetilde {\mathcal {F}}^{k}\), \(\widetilde {\mathcal {H}}^{k}\) be the decision sub-object cognitive operators. For any x′, x″ ∈ G k, if \(\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x')=\{x'\}\) and \(\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x'') \supset \{x''\}\), then the pairs \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x'),\widetilde {\mathcal {F}}^{k}(x'))\) and \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x''),\widetilde {\mathcal {F}}^{k}(x''))\) are known as object-oriented decision granular concepts (or simply object-oriented decision concepts). For convenience, we denote

$$\displaystyle \begin{aligned} \mathcal{O}\mathcal{G}_{\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}}= &\{(\widetilde{\mathcal{H}}^{k}\widetilde{\mathcal{F}}^{k}(x'),\widetilde{\mathcal{F}}^{k}(x'))|x'\in G^{k}\} \cup \\ &\{(\widetilde{\mathcal{H}}^{k}\widetilde{\mathcal{F}}^{k}(x''),\widetilde{\mathcal{F}}^{k}(x''))|x''\in G^{k}\}. \end{aligned} $$

Meanwhile, for any k′, k″ ∈ D, if \(\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k')=\{k'\}\) and \(\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k'') \supset \{k''\}\), then the pairs \((\widetilde {\mathcal {H}}^{k}(k'),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k'))\) and \((\widetilde {\mathcal {H}}^{k}(k''),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k''))\) are called attribute-oriented decision granular concepts (or simply attribute-oriented decision concepts). For brevity, we further denote

$$\displaystyle \begin{aligned} \mathcal{A}\mathcal{G}_{\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}}= &\{(\widetilde{\mathcal{H}}^{k}(k'),\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}(k'))|k'\in D\} \cup \\ &\{(\widetilde{\mathcal{H}}^{k}(k''),\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}(k''))|k''\in D\}. \end{aligned} $$

To facilitate the subsequent discussion, in a regular sub-object formal decision context, the conditional concept space and decision concept space are respectively denoted by

$$\displaystyle \begin{aligned} \mathcal{G}_{\mathcal{F}^{k}\mathcal{H}^{k}}&=\mathcal{O}\mathcal{G}_{\mathcal{F}^{k}\mathcal{H}^{k}}\cup \mathcal{A}\mathcal{G}_{\mathcal{F}^{k}\mathcal{H}^{k}} \\ &=\{(\mathcal{H}^{k}\mathcal{F}^{k}(x),\mathcal{F}^{k}(x))|x\in G^{k}\} \cup \\ &\quad \ \ \{(\mathcal{H}^{k}(a),\mathcal{F}^{k}\mathcal{H}^{k}(a))|a\in M\}, \mbox{and} \\ \mathcal{G}_{\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}}&=\mathcal{O}\mathcal{G}_{\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}}\cup \mathcal{A}\mathcal{G}_{\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}} \\ &=\{(\widetilde{\mathcal{H}}^{k}\widetilde{\mathcal{F}}^{k}(x),\widetilde{\mathcal{F}}^{k}(x))|x\in G^{k}\} \cup \\ &\quad \ \ \{(\widetilde{\mathcal{H}}^{k}(k'),\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}(k'))|k'\in D\}. \end{aligned} $$

It means that the concept spaces of sub-object set G k can be constructed by means of the object-oriented concepts and attribute-oriented concepts.

Theorem 6.2

Let \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\) be the conditional concept space and decision concept space, respectively. Then the following statements hold:

  1. (1)

    For any conditional concepts \((\mathcal {H}^{k}\mathcal {F}^{k}(x),\mathcal {F}^{k}(x))\) and \((\mathcal {H}^{k}(a),\mathcal {F}^{k}\mathcal {H}^{k}(a))\) , if there exists a conditional concept \((\mathcal {H}^{k}\mathcal {F}^{k}(x_{i}),\mathcal {F}^{k}(x_{i})) \in \mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) such that \(\mathcal {F}^{k}(x) \subseteq \mathcal {F}^{k}(x_{i})\ (i\in \{1,2,\ldots ,|G^{k}|\})\) and a conditional concept \((\mathcal {H}^{k}(a_{j}),\mathcal {F}^{k}\mathcal {H}^{k}(a_{j})) \) \(\in \mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) such that \(\mathcal {H}^{k}(a) \subseteq \) \(\mathcal {H}^{k}(a_{j})\ (j\in \{1, 2, \ldots ,|M|\})\) , then we have

    $$\displaystyle \begin{aligned} & (\mathcal{H}^{k}\mathcal{F}^{k}(x),\mathcal{F}^{k}(x))=(\{x \cup \bigcup\limits_{i \in \{1,2,\ldots,|G^{k}|\}}x_{i}\},\mathcal{F}^{k}(x)),\\ & (\mathcal{H}^{k}(a),\mathcal{F}^{k}\mathcal{H}^{k}(a))=(\mathcal{H}^{k}(a),\{a\cup \bigcup\limits_{j\in \{1,2,\ldots,|M|\}}a_{j}\}); \end{aligned} $$
    (6.23)

    otherwise,

    $$\displaystyle \begin{aligned} &(\mathcal{H}^{k}\mathcal{F}^{k}(x),\mathcal{F}^{k}(x))=(\{x \},\mathcal{F}^{k}(x)),\\ &(\mathcal{H}^{k}(a),\mathcal{F}^{k}\mathcal{H}^{k}(a))=(\mathcal{H}^{k}(a),\{a\}). \end{aligned} $$
    (6.24)
  2. (2)

    For any decision concepts \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x),\widetilde {\mathcal {F}}^{k}(x))\) and \((\widetilde {\mathcal {H}}^{k}(k'),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k'))\) , if there exists a decision concept \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x_{i}),\widetilde {\mathcal {F}}^{k}(x_{i})) \in \mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\) such that \(\widetilde {\mathcal {F}}^{k}(x) \subseteq \widetilde {\mathcal {F}}^{k}(x_{i})\ (i\in \ \{1, 2, \ldots ,|G^{k}|\})\) and a decision concept \((\widetilde {\mathcal {H}}^{k}(k_{j}),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k_{j})) \in \mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\) such that \(\widetilde {\mathcal {H}}^{k}(k') \subseteq \widetilde {\mathcal {H}}^{k}(k_{j})\ (j\in \{1,2,\ldots ,|D|\})\) , then following statements hold:

    $$\displaystyle \begin{aligned} & (\widetilde{\mathcal{H}}^{k}\widetilde{\mathcal{F}}^{k}(x),\widetilde{\mathcal{F}}^{k}(x))=(\{x \cup \bigcup\limits_{i \in \{1,2,\ldots,|G^{k}|\}}x_{i}\},\widetilde{\mathcal{F}}^{k}(x)),\\ & (\widetilde{\mathcal{H}}^{k}(k'),\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}(k'))=(\widetilde{\mathcal{H}}^{k}(k'),\{k'\cup \\ & \bigcup\limits_{j\in \{1,2,\ldots,|D|\}}k_{j}\}); \end{aligned} $$
    (6.25)

    otherwise,

    $$\displaystyle \begin{aligned} &(\widetilde{\mathcal{H}}^{k}\widetilde{\mathcal{F}}^{k}(x),\widetilde{\mathcal{F}}^{k}(x))=(\{x \},\widetilde{\mathcal{F}}^{k}(x)),\\ &(\widetilde{\mathcal{H}}^{k}(k'),\widetilde{\mathcal{F}}^{k}\widetilde{\mathcal{H}}^{k}(k'))=(\widetilde{\mathcal{H}}^{k}(k'),\{k'\}). \end{aligned} $$
    (6.26)

Proof

The proof of Theorem 6.2 can be found in the original paper [43]. □

Property 6.4

Let \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{\diamond }\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{\diamond }\) be two concept spaces, \(\mathcal {A}\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) and \(\mathcal {A}\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\) be the attribute-oriented conditional concept space and attribute-oriented decision concept space, respectively; meanwhile, initialize \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{\diamond }=\mathcal {A}\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{\diamond }=\mathcal {A}\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\). Then we have

  1. (1)

    For each x ∈ G k, if there exists \((\mathcal {H}^{k}(a),\mathcal {F}^{k}\mathcal {H}^{k}(a))\in \mathcal {A}\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}\) such that \(\mathcal {F}^{k}(x)=\mathcal {F}^{k}\mathcal {H}^{k}(a)\), then \((\mathcal {H}^{k}\mathcal {F}^{k}(x),\mathcal {F}^{k}(x))=(\mathcal {H}^{k}(a),\mathcal {F}^{k}\mathcal {H}^{k}(a))\); otherwise,

    \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{\diamond } =\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{\diamond } \cup (\mathcal {H}^{k}\mathcal {F}^{k}(x),\mathcal {F}^{k}(x))\).

  2. (2)

    For each x ∈ G k, if there exists \((\widetilde {\mathcal {H}}^{k}(k'),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k'))\in \mathcal {A}\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}\) such that \(\widetilde {\mathcal {F}}^{k}(x)=\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k')\), then \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x),\widetilde {\mathcal {F}}^{k}(x))= (\widetilde {\mathcal {H}}^{k}(k'),\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k'))\);

    otherwise,

    \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{\diamond } =\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{\diamond } \cup (\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x),\widetilde {\mathcal {F}}^{k}(x))\).

Proof

The proof of Property 6.4 can be found in the original paper [43]. □

Property 6.4 means that we do not need to construct concepts \((\mathcal {H}^{k}\mathcal {F}^{k}(x),\mathcal {F}^{k}(x))\) and \((\widetilde {\mathcal {H}}^{k}\widetilde {\mathcal {F}}^{k}(x),\widetilde {\mathcal {F}}^{k}(x))\) like [58] when \(\mathcal {F}^{k}(x)=\mathcal {F}^{k}\mathcal {H}^{k}(a)\) and \(\widetilde {\mathcal {F}}^{k}(x)=\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}(k')\). Then, using this approach, we can finally obtain \(\mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}= \mathcal {G}_{\mathcal {F}^{k}\mathcal {H}^{k}}^{\diamond }\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}= \mathcal {G}_{\widetilde {\mathcal {F}}^{k}\widetilde {\mathcal {H}}^{k}}^{\diamond }\).

For convenience, we denote the labeled dataset S L by G 0, and the initial concept spaces by \(\mathcal {G}_{\mathcal {F}_{0}\mathcal {H}_{0}}\) and \(\mathcal {G}_{\widetilde {\mathcal {F}}_{0}\widetilde {\mathcal {H}}_{0}}\). Note that, in the initial concept space period, if the object set G k is replaced with \(G_{0}^{k}\), then the corresponding cognitive operators \(\mathcal {F}^{k},\mathcal {H}^{k}\) and \(\widetilde {\mathcal {F}}^{k},\widetilde {\mathcal {H}}^{k}\) can be expressed as \(\mathcal {F}_{0}^{k},\mathcal {H}_{0}^{k}\) and \(\widetilde {\mathcal {F}}_{0}^{k},\widetilde {\mathcal {H}}_{0}^{k}\), respectively.

1.3.2 Cognitive Process with Unlabeled Data in Concept Learning

In the concept-cognitive process, suppose the obtained concept spaces will be updated by a newly added object instead of inputting multi-objects simultaneously. Then, for the unlabeled set S U, we can denote S U as ΔG = { ΔG 0,  ΔG 1, …,  ΔG n−1} in which each learning step only consists of one object x (i.e., ΔG i = {x i}). For brevity, in what follows, we write {x i} as x i and then we have ΔG = {x 0, x 1, …, x n−1}.

Different from [58], we assume that an object x is connected with a virtual label k due to no label information. Then we have the conditional sub-object cognitive operators and decision sub-object cognitive operators with the newly input data \(\Delta G^{k^{*}}_{i-1}=G^{k^{*}}_{i}-G^{k^{*}}_{i-1}\) as follows:

$$\displaystyle \begin{aligned} &\mathrm{(i)} \ \ \mathcal{F}^{k^{*}}_{i-1}:2^{G^{k^{*}}_{i-1}}\rightarrow 2^{M}, \qquad \ \mathcal{H}^{k^{*}}_{i-1}:2^{M}\rightarrow 2^{G^{k^{*}}_{i-1}},\\ &\mathrm{(ii)} \ \mathcal{F}^{k^{*}}_{\Delta G^{k^{*}}_{i-1}}:2^{\Delta G^{k^{*}}_{i-1}}\rightarrow 2^{M}, \ \mathcal{H}^{k^{*}}_{\Delta G^{k^{*}}_{i-1}}:2^{M}\rightarrow 2^{\Delta G^{k^{*}}_{i-1}},\\ &\mathrm{(iii)} \ \! \mathcal{F}^{k^{*}}_{i}:2^{G^{k^{*}}_{i}}\rightarrow 2^{M}, \quad \qquad \! \mathcal{H}^{k^{*}}_{i}:2^{M}\rightarrow 2^{G^{k^{*}}_{i}},\\ \end{aligned} $$
(6.27)

and

$$\displaystyle \begin{aligned} &\mathrm{(i)} \ \ \widetilde{\mathcal{F}}^{k^{*}}_{i-1}:2^{G^{k^{*}}_{i-1}}\rightarrow 2^{D}, \qquad \ \widetilde{\mathcal{H}}^{k^{*}}_{i-1}:2^{D}\rightarrow 2^{G^{k^{*}}_{i-1}},\\ &\mathrm{(ii)} \ \widetilde{\mathcal{F}}^{k^{*}}_{\Delta G^{k^{*}}_{i-1}}:2^{\Delta G^{k^{*}}_{i-1}}\rightarrow 2^{D}, \ \widetilde{\mathcal{H}}^{k^{*}}_{\Delta G^{k^{*}}_{i-1}}:2^{D}\rightarrow 2^{\Delta G^{k^{*}}_{i-1}},\\ &\mathrm{(iii)} \ \! \widetilde{\mathcal{F}}^{k^{*}}_{i}:2^{G^{k^{*}}_{i}}\rightarrow 2^{D}, \quad \qquad \! \widetilde{\mathcal{H}}^{k^{*}}_{i}:2^{D}\rightarrow 2^{G^{k^{*}}_{i}}. \end{aligned} $$
(6.28)

Theorem 6.3

Let \(\mathcal {A}\mathcal {G}_{\mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}}\), \(\mathcal {A}\mathcal {G}_{\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}}\) and \(\mathcal {O}\mathcal {G}_{\mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}}\), \(\mathcal {O}\mathcal {G}_{\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}}\) be the attribute-oriented concept spaces and object-oriented concept spaces, respectively. Then we have

  1. (1)

    For any a′ M and \((\mathcal {H}_{i-1}^{k^{*}}(a''),\mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}(a''))\in \mathcal {A}\mathcal {G}_{\mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}}\) , if \(\mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}(a'') \cap \\\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(a') \neq \emptyset \) , then

    \((\mathcal {H}_{i}^{k^{*}}(a''),\mathcal {F}_{i}^{k^{*}}\mathcal {H}_{i}^{k^{*}}(a''))=(\mathcal {H}_{i-1}^{k^{*}}(a'') \cup \mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(a'), \mathcal {F}_{i-1}^{k^{*}}\mathcal {H}_{i-1}^{k^{*}}(a'') \cap \\ \ \ \mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(a'))\) ; otherwise, \((\mathcal {H}_{i}^{k^{*}}(a'),\mathcal {F}_{i}^{k^{*}}\mathcal {H}_{i}^{k^{*}}(a'))=(\mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(a'), \mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(a'))\).

  2. (2)

    For any \(x' \in \Delta G^{k^{*}}\) and \((\mathcal {H}_{i-1}^{k^{*}}\mathcal {F}_{i-1}^{k^{*}}(x''),\mathcal {F}_{i-1}^{k^{*}}(x'')) \in \mathcal {O}\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1}\mathcal {H}^{k^{*}}_{i-1}}\) , if \(\mathcal {F}_{i-1}^{k^{*}}(x'') \subseteq \\\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x')\) , then \((\mathcal {H}^{k^{*}}_{i}\mathcal {F}^{k^{*}}_{i}(x''),\mathcal {F}^{k^{*}}_{i}(x''))=(\mathcal {H}_{i-1}^{k^{*}}\mathcal {F}_{i-1}^{k^{*}}(x'') \cup \mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\\ \ \ \mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'),\mathcal {F}^{k^{*}}_{i-1}(x''))\) ; if \(\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x') \subseteq \mathcal {F}_{i-1}^{k^{*}}(x'')\) , then \((\mathcal {H}^{k^{*}}_{i}\mathcal {F}^{k^{*}}_{i}(x''),\mathcal {F}^{k^{*}}_{i}(x'')) = (\mathcal {H}_{i-1}^{k^{*}}\mathcal {F}_{i-1}^{k^{*}}(x'') \cup \mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'),\\ \ \ \mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'))\);

    otherwise,

    \((\mathcal {H}^{k^{*}}_{i}\mathcal {F}^{k^{*}}_{i}(x'),\mathcal {F}^{k^{*}}_{i}(x')) = (\mathcal {H}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'),\mathcal {F}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'))\).

  3. (3)

    For any k′ D and \((\widetilde {\mathcal {H}}_{i-1}^{k^{*}}(k''),\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}(k''))\in \mathcal {A}\mathcal {G}_{\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}}\) , if \(\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}(k'') \cap\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(k') \neq \emptyset \) , then

    \((\widetilde {\mathcal {H}}_{i}^{k^{*}}(k''),\widetilde {\mathcal {F}}_{i}^{k^{*}}\widetilde {\mathcal {H}}_{i}^{k^{*}}(k''))=(\widetilde {\mathcal {H}}_{i-1}^{k^{*}}(k'') \cup \widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(k'),\widetilde {\mathcal {F}}_{i-1}^{k^{*}}\widetilde {\mathcal {H}}_{i-1}^{k^{*}}(k'') \cap \\ \ \ \widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(k'))\);

    otherwise,

    \((\widetilde {\mathcal {H}}_{i}^{k^{*}}(k'),\widetilde {\mathcal {F}}_{i}^{k^{*}}\widetilde {\mathcal {H}}_{i}^{k^{*}}(k'))=(\widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(k'),\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(k'))\).

  4. (4)

    For any \(x' \in \Delta G^{k^{*}}\) and \((\widetilde {\mathcal {H}}_{i-1}^{k^{*}}\widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x''),\widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x''))\in \mathcal {O}\mathcal {G}_{\widetilde {\mathcal {F}}^{k^{*}}_{i-1}\widetilde {\mathcal {H}}^{k^{*}}_{i-1}}\) , if \(\widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x'') \subseteq \\\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x')\) , then \((\widetilde {\mathcal {H}}^{k^{*}}_{i}\widetilde {\mathcal {F}}^{k^{*}}_{i}(x''),\widetilde {\mathcal {F}}^{k^{*}}_{i}(x'')) = (\widetilde {\mathcal {H}}_{i-1}^{k^{*}}\widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x'') \cup \widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\\ \ \ \widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'), \widetilde {\mathcal {F}}^{k^{*}}_{i-1}(x''))\) ; if \(\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x') \subseteq \widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x'')\) , then \((\widetilde {\mathcal {H}}^{k^{*}}_{i}\widetilde {\mathcal {F}}^{k^{*}}_{i}(x''),\widetilde {\mathcal {F}}^{k^{*}}_{i}(x'')) = (\widetilde {\mathcal {H}}_{i-1}^{k^{*}}\widetilde {\mathcal {F}}_{i-1}^{k^{*}}(x'') \cup \widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'),\\ \ \ \widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'))\);

    otherwise,

    \((\widetilde {\mathcal {H}}^{k^{*}}_{i}\widetilde {\mathcal {F}}^{k^{*}}_{i}(x'),\widetilde {\mathcal {F}}^{k^{*}}_{i}(x')) =(\widetilde {\mathcal {H}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'),\widetilde {\mathcal {F}}^{k^{*}}_{\Delta G_{i-1}^{k^{*}}}(x'))\).

Proof

The proof of Theorem 6.3 can be found in the original paper [42]. □

Although Theorem 6.3 shows how to update the concept spaces when adding an instance, concept recognition is exceedingly difficult since we cannot directly recognize the real class label of each instance x. Namely, unlike the initial concept spaces generation, there is still a mystery that which sub-concept space will be updated when inputting a new object without label information.

1.3.3 Concept Recognition

For any newly input object x, the concept \((\mathcal {H}^{k^{*}}_{\Delta G^{k^{*}}_{i}}\mathcal {F}^{k^{*}}_{\Delta G^{k^{*}}_{i}}(x),\mathcal {F}^{k^{*}}_{\Delta G^{k^{*}}_{i}}(x))\) can be rewritten as \((\{x\},\mathcal {F}^{k^{*}}_{\Delta G^{k^{*}}_{i}}(x))\) due to \(|\Delta G^{k^{*}}_{i}|=1\). Meanwhile, to meet the demand of lots of unlabeled data, a new similarity metric for concept learning is proposed in this subsection. As a matter of fact, a good assessing similarity for concepts is a key success of S2CL.

Definition 6.22

Let \(\mathcal {G}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}\) be the concept space and \(\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}\) be a sub-concept space with a virtual label k in the (i − 1)-th state. For any concept \((X_{j},B_{j}) \in \mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}\), where \(j \in \{ 1,2,\ldots ,|\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}|\}\), the global information \(w_{i-1,k^{*}}\) and the local information \(z^{k^{*}}_{i-1,j}\) in the (i − 1)-th state are, respectively, defined as

$$\displaystyle \begin{aligned} &w_{i-1,k^{*}}= \frac{|\mathcal{G}_{\mathcal{F}^{k^{*}}_{i-1},\mathcal{H}^{k^{*}}_{i-1}}|}{|\mathcal{G}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}}|}, \end{aligned} $$
(6.29)
$$\displaystyle \begin{aligned} &z^{k^{*}}_{i-1,j}= \frac{|X_{j}|}{|\mathcal{G}_{\mathcal{F}^{k^{*}}_{i-1},\mathcal{H}^{k^{*}}_{i-1}}|}. \end{aligned} $$
(6.30)

More generally, considering the entire concept space \(\mathcal {G}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}\) in the (i − 1)-th state, we denote

$$\displaystyle \begin{aligned} &\boldsymbol{w}_{i-1}=(w_{i-1,1},w_{i-1,2},\ldots,w_{i-1,l}), \end{aligned} $$
(6.31)
$$\displaystyle \begin{aligned} &\boldsymbol{z}_{i-1}=\left[ \begin{array}{c} \boldsymbol{z}^{1}_{i-1}\\ \vdots \\ \boldsymbol{z}^{l}_{i-1}\\ \end{array} \right]=\left[ \begin{array}{ccc} z^{1}_{i-1,1} &\cdots &z^{1}_{i-1,m_{1}}\\ \vdots &\ddots &\vdots \\ z^{l}_{i-1,1} &\cdots &z^{l}_{i-1,m_{l}}\\ \end{array} \right], \end{aligned} $$
(6.32)

where \(m_{k^{*}}=\big |\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}\big |\) and k ∈{1, 2, …, K}.

Definition 6.23

Let \(C=(\{x\},\mathcal {F}_{\Delta G^{k^{*}}_{i}}(x))\) be a newly input concept. For any concept \((X_{j},B_{j}) \in \mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}\), where \(j \in \{ 1,2,\ldots ,|\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}|\}\), the concept similarity (CS) can be defined as:

$$\displaystyle \begin{aligned} \theta^{I}_{j}=I\frac{|A^{*} \cap B_{j}|}{|A^{*} \cap B_{j}|+2(\alpha|A^{*} - B_{j}|+(1 - \alpha)| B_{j} - A^{*}|)}, \end{aligned} $$
(6.33)

where \(I=1/(1+w_{i-1,k^{*}}\times e^{-z^{k^{*}}_{i-1,j}})\), \(A^{*}=\mathcal {F}_{\Delta G^{k^{*}}_{i}}(x)\) and α ∈ [0, 1].

For Eq. (6.33), I is set to be 1 when without considering the global and local information. In this case, Eq. (6.33) can further be formulated as

$$\displaystyle \begin{aligned} \theta_{j}=\frac{|A^{*} \cap B_{j}|}{|A^{*} \cap B_{j}|+2(\alpha|A^{*} - B_{j}|+(1 - \alpha)| B_{j} - A^{*}|)}. \end{aligned} $$
(6.34)

In Eq. (6.34), A − B j represents the characteristics appearing in A but not in B j, and it has the same meaning for B j − A . Moreover, the parameters α and (1 − α) can be, respectively, considered as the weight information added to |A − B j| and |B j − A |, which express the importance of different features of A − B j and B j − A relative to the overall similarity degree. In fact, when α = 0.5, Eq. (6.34) is degenerated into Jaccard similarity [27, 65].

According to sample separation axiom [79], for any instance, there always exists a unique class that is most similar to it. Hence, given an instance x, the class vector can be generated as follows: each sub-concept space will first produce a set of CS degrees by computing the CS degree between the given concept and any concept from a sub-concept space. Then, the maximum CS degree (\(\widehat {\theta }^{I}_{j}\)) of each sub-concept space will be obtained, namely, \(\widehat {\theta }^{I}_{j}=\max\limits _{j\in J} \{\theta ^{I}_{j}\}\), where \(J=\{ 1,2,\ldots ,|\mathcal {G}_{\mathcal {F}^{k^{*}}_{i-1},\mathcal {H}^{k^{*}}_{i-1}}|\}\). Finally, the estimated class distribution will form a maximum class vector \((\widehat {\theta }^{I}_{1},\widehat {\theta }^{I}_{2},\ldots ,\widehat {\theta }^{I}_{l})^{\mbox{T}}\). In the same manner, we can obtain an average class vector \((\overline {\theta }^{I}_{1},\overline {\theta }^{I}_{2},\ldots ,\overline {\theta }^{I}_{l})^{\mbox{T}}\).

Note that a SSL method, which is designed by combining the concept-cognitive process with the structural concept similarity θ j, is referred to as a semi-supervised concept learning method, and it is abbreviated as S2CL for convenience. In the meanwhile, an extended version of S2CL is further proposed by taking full advantage of the global and local conceptual information (i.e., the structural concept similarity \(\theta ^{I}_{j}\)) within a concept space. For conciseness, we also write it as S2CLα when no confusion exists.

1.3.4 Theoretical Analysis

Essentially, α mainly reflects the influences of different characteristics in sets A − B j and B j − A for the overall concept similarity measure. Hence, it is very important to discuss how to select an appropriate α on each dataset.

Let \(\mathcal {Y}=\{1,2,\ldots ,l\}\) be the label space. The concept spaces with different α r (α r ∈ [0, 1]) in the (i − 1)-th period can be formulated as

$$\displaystyle \begin{aligned} &\left[ \begin{array}{c} \mathcal{G}^{\alpha_{1}}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}}\\ \vdots \\ \mathcal{G}^{\alpha_{n}}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}}\\ \end{array} \right]=\left[ \begin{array}{ccc} \mathcal{G}^{\alpha_{1},1}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}} &\cdots &\mathcal{G}^{\alpha_{1},l}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}}\\ \vdots &\ddots &\vdots \\ \mathcal{G}^{\alpha_{n},1}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}} &\cdots &\mathcal{G}^{\alpha_{n},l}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}}\\ \end{array} \right], \end{aligned} $$
(6.35)

where \(\sum\limits _{r=1}^{n} \alpha _{r}=1\).

For an object x i, we can obtain its corresponding concept C i = ({x i}, B i). Then, based on Definition 6.23, we denote

$$\displaystyle \begin{aligned} Sim(C_{i},\mathcal{G}^{\alpha_{r},k}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}})=\{Sim(C_{i},C_{j}^{\alpha_{r}})\}_{j=1}^{m_{k}}=\{\theta^{I}_{j}\}_{j=1}^{m_{k}}, \end{aligned} $$
(6.36)

where \(C_{j}^{\alpha _{r}}\in \mathcal {G}^{\alpha _{r},k}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}} (k \in \mathcal {Y})\) and \(m_{k}=|\mathcal {G}^{\alpha _{r},k}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}|\).

Combining Eqs. (6.35) with (6.36), the corresponding concept similarity in the (i − 1)-th state can be described as

$$\displaystyle \begin{aligned} &\left[ \begin{array}{c} S(C_{i},\mathcal{G}^{\alpha_{1}}_{i-1})\\ \vdots \\ S(C_{i},\mathcal{G}^{\alpha_{n}}_{i-1})\\ \end{array} \right]=\left[ \begin{array}{ccc} S(C_{i},\mathcal{G}^{\alpha_{1},1}_{i-1}) &\cdots &S(C_{i},\mathcal{G}^{\alpha_{1},l}_{i-1})\\ \vdots &\ddots &\vdots \\ S(C_{i},\mathcal{G}^{\alpha_{n},1}_{i-1}) &\cdots &S(C_{i},\mathcal{G}^{\alpha_{n},l}_{i-1})\\ \end{array} \right], \end{aligned} $$
(6.37)

where \(S(C_{i},\mathcal {G}^{\alpha _{r}}_{i-1})=Sim(C_{i},\mathcal {G}^{\alpha _{r}}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}})\ (r\in \{1,2,\ldots ,n\})\) and \(S(C_{i},\mathcal {G}^{\alpha _{r},k}_{i-1})=Sim(C_{i},\mathcal {G}^{\alpha _{r},k}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}})\).

Furthermore, inspired by [79], the category similarity function between the given concept C i and a class space \(\mathcal {G}^{\alpha _{r},k}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}}\) can be defined as

$$\displaystyle \begin{aligned} \phi_{Sim}(C_{i},\mathcal{G}^{\alpha_{r},k}_{\mathcal{F}_{i-1},\mathcal{H}_{i-1}})=\frac{|N^{\alpha_{r}}_{k}(C_{i})|}{K}, \end{aligned} $$
(6.38)

where \(N^{\alpha _{r}}_{k}(C_{i})=\{C_{j}| C_{j}\in \mathcal {G}^{\alpha _{r},k}_{\mathcal {F}_{i-1},\mathcal {H}_{i-1}} \wedge C_{j}\in N^{\alpha _{r}}_{K}(C_{i})\}\), and \(N^{\alpha _{r}}_{K}(C_{i})\) is a set of near neighbor instances related to x i under the parameter α r.

According to top-K set similarity [77], if \(\widehat {k}=\operatorname *{\arg \max }_{k\in \mathcal {Y}} \frac {|N^{\alpha _{r}}_{k}(C_{i})|}{K}\), then the instance x i is classified into the \(\widehat {k}\)-th class. Therefore, given the parameter K, the objective function can be formulated as

$$\displaystyle \begin{aligned} \widehat{\alpha}_{r}&=\operatorname*{\arg\min}_{\alpha_{r}\in[0, 1]} \sum_{i=1}^{m} \big(\frac{|N^{\alpha_{r}}_{k}(C_{i})|}{K}-y_{i}\big)^{2}\\ & \mbox{s.t.}\ \sum_{r=1}^{n} \alpha_{r}=1. \end{aligned} $$
(6.39)

In Eq. (6.39), our aim is to capture an optimal concept space with the concept structural information.

1.3.5 Framework and Computational Complexity Analysis

For brevity, we can consider that there are three classes to predict. Figure 6.3 illustrates the whole procedure of S2CL. From a dataset (that contains a small set of labeled data and a large amount of unlabeled data), we first obtain a corresponding regular formal decision context. Then, the initial concept spaces (that include a conditional concept space and its corresponding decision concept space) with concept structural information will be constructed based on the cognitive operators. Specifically speaking, the conditional concept space contains three sub-concept spaces, where each sub-concept space is composed of different concepts. As shown in the stage of initial concept spaces of Fig. 6.3 (see the left of Fig. 6.3 for details), there exist three sub-concept spaces corresponding to three classes in a conditional concept space, and each sub-concept space contains two different types of concepts, namely object-oriented conditional concepts (indicated by red shapes in Fig. 6.3) and attribute-oriented conditional concepts (denoted by black shapes in Fig. 6.3). Meanwhile, each sub-concept space is also associated with a decision concept in the corresponding decision concept space as shown in the first stage of Fig. 6.3. Thirdly, for any newly input unlabeled data, they are first used to form concepts, and then the concept-cognitive process is completed by concept recognition. Finally, given the parameter K, S2CL (or S2CLα) trys to learn an optimal concept space based on the concept recognition and concept-cognitive process under different parameters α r(r = 1, 2, …, n). In other words, the objective of S2CL (or S2CLα) is to seek an appropriate concept space to represent the underlying data distributions by the concept-cognitive process.

Fig. 6.3
figure 3

Illustration of the framework of the proposed methods. Considering that there are three classes to predict, the concept spaces with concept structural information will be constructed, which include three different sub-concept spaces

In the prediction stage, given an instance, the final concept space can produce two estimates of class distribution (including a maximum class vector and an average class vector) by employing the CS degree θ j (or \(\theta ^{I}_{j}\)). Then the final CS degree vector will be obtained by the sum of the two 3-dimensional class vectors, and the class with maximum value will be output as shown in Fig. 6.3.

Based on the above discussion, we are ready to propose the corresponding algorithm of S2CL (see Algorithm 6.10 for details). In Algorithm 6.10, Step 3 is to generate the initial concept spaces; then the concept recognition and concept-cognitive process are conducted by running Steps 4–8; at last, the final prediction will be completed by Steps 9–12. In Steps 9–12, if the prediction value \(\widehat {k}\) is consistent with the ground truth label, then it means that the predicted value of S2CL is correct. Formally, the accuracy on a test dataset T can be descried as \(acc=\frac {N}{|T|}\), where N denotes the number of correct predicted values. Simultaneously, it will be easy to obtain the algorithm of S2CLα by means of replacing the structural concept similarity θ j with \(\theta ^{I}_{j}\) in Step 6 of Algorithm 6.10.

The time complexity of S2CL is mainly composed of two parts, i.e. constructing the initial concept spaces and the concept-cognitive process with concept structural information. Let the time complexity of constructing a concept, computing the CS degree and updating the concept space be O(t 1), O(t 2) and O(t 3), respectively. Then, it is easy to verify that the time complexity of Step 3 is O(t 1|S L|(|M| + |D|)), and the complexity of accomplishing the concept-cognitive process by concept recognition is O(|S U|(t 1 + t 2 + t 3)). Note that, CCL is an incremental learning process, as the proposed method is updated by inputting objects one by one. Therefore, S2CL can also be regarded as an incremental method for SSL in dynamic environments. For convenience, let E and C (that randomly selects instances from S U) be the incremental learning step and the sample size of each incremental learning step, respectively. Thus, the time complexity of incremental learning (see Algorithm 6.11 for details) is O(E(|S U|(t 1 + t 2 + t 3)) + |T|).

Algorithm 6.10 S2CL algorithm

Algorithm 6.11 Incremental learning

1.4 Fuzzy-Based Concept Learning Method: Exploiting Data with Fuzzy Conceptual Clustering

1.4.1 Preliminaries

In this subsection, we review some notions related to the fuzzy formal decision context.

In a classical formal decision context, the conditional attributes are discrete. However, in the real world, many tasks (e.g., classification, image segmentation, etc.) are described with numerical (or fuzzy) data, which means that classical formal decision contexts cannot cope with them directly. Therefore, a fuzzy formal decision context is proposed based on fuzzy sets [81].

Let G be a universe of discourse. A fuzzy set \(\widetilde {X}\) on G can be defined as follows:

$$\displaystyle \begin{aligned} \widetilde{X}=\{<x,\mu_{\widetilde{X}}(x)>|x\in G\}, \end{aligned}$$

where \(\mu _{\widetilde {X}}:G\rightarrow [0,1]\), and \(\mu _{\widetilde {X}}(x)\) is referred to as the membership degree to \(\widetilde {X}\) of the object x ∈ G. And we denote by L G the set of all fuzzy sets on G.

Definition 6.24 ([83])

A fuzzy formal context \((G,M,\widetilde {I})\) is a triple, where G is a set of objects, M is a set of attributes, and \(\widetilde {I}\) is a fuzzy relation between G and M. Each relation \((x,a)\in \widetilde {I}\) has a membership degree \(\mu _{\widetilde {I}}(x,a)\) in [0, 1], and we denote by \(\widetilde {I} (x,a)=\mu _{\widetilde {I}}(x,a)\) for the sake of convenience.

Definition 6.25 ([4, 78, 83])

Let \((G,M,\widetilde {I})\) be a fuzzy formal context. For X ⊆ G and \(\widetilde {B} \in L^{M}\), the operator (⋅) is defined as follows:

$$\displaystyle \begin{aligned} &X^{\ast}(a)=\bigwedge\limits_{x\in X} \widetilde{I}(x,a), a\in M,\\ &\widetilde{B}^{\ast}=\{x\in G|\forall a\in M, \widetilde{B}(a) \leq \widetilde{I}(x,a)\}. \end{aligned} $$
(6.40)

Then, we say that a pair \((X,\widetilde {B})\) is a fuzzy concept of a fuzzy formal context \((G,M,\widetilde {I})\) if \(X^{\ast }=\widetilde {B}\), \(\widetilde {B}^{\ast }=X\), and X and \(\widetilde {B}\) are respectively known as the extent and intent of the fuzzy concept \((X,\widetilde {B})\). For convenience, the set of all fuzzy concepts is denoted by \(L(G,M,\widetilde {I})\). In [83], \(L(G,M,\widetilde {I})\) is called a special crisp-fuzzy variable threshold concept lattice under the circumstance of the threshold being set to be 1. For \((X_{1},\widetilde {B}_{1}),(X_{2},\widetilde {B}_{2}) \in L(G,M,\widetilde {I})\), we define the order relation \((X_{1},\widetilde {B}_{1})\leq (X_{2},\widetilde {B}_{2})\) if and only if X 1 ⊆ X 2 (or \(\widetilde {B}_{2}\subseteq \widetilde {B}_{1}\)). Then we say that \((X_{1},\widetilde {B}_{1})\) is a sub-concept of \( (X_{2},\widetilde {B}_{2})\) and \( (X_{2},\widetilde {B}_{2})\) is a super-concept of \((X_{1},\widetilde {B}_{1})\).

Definition 6.26 ([55])

Let \((G,M,\widetilde {I})\) and \((G,D,\widetilde {J})\) be two fuzzy formal contexts, where \(\widetilde {I}:G\times M \rightarrow [0,1]\) and \(\widetilde {J}:G\times D \rightarrow [0,1]\). Then \((G,M,\widetilde {I},D,\widetilde {J})\) is referred to as a fuzzy formal decision context, where M ∩ D = ∅, and M and D are the conditional and decision attribute sets, respectively.

Note that a quintuple \((G,M,I,D,\widetilde {J})\) is called a crisp-fuzzy formal decision context in [48], where (G, M, I) and \((G,D,\widetilde {J})\) are respectively a classical formal context and fuzzy formal context.

1.4.2 Fuzzy Concept Learning Method

In this subsection, we first show some new notions and properties for the proposed FCLM, which includes a regular fuzzy formal decision context, an object-oriented fuzzy conceptual clustering, and the related theoretical analysis. Based on them, we further present the detailed procedure of FCLM.

1.4.2.1 A. Regular Fuzzy Formal Decision Context

According to Definition 6.26, Fig. 6.4a and b represents two different fuzzy formal decision contexts \((G,M,\widetilde {I},D,\widetilde {J})\) and \((G,M,I,D,\widetilde {J})\), respectively. More precisely, Fig. 6.4a expresses a fuzzy formal decision context in which M and D are both numerical; Fig. 6.4b denotes a fuzzy formal decision context, where M is discrete and D is numerical. However, in the real application, most original data are often presented in the form of Fig. 6.4c. It means that the decision attribute set D is described with discrete label information and the conditional attribute set M is constitutive of fuzzy data.

Fig. 6.4
figure 4

Illustration of three different forms of fuzzy formal decision contexts

Definition 6.27

Let \((G,M,\widetilde {I})\) be a fuzzy formal context and (G, D, J) be a classical formal context. Then the quintuple \((G,M,\widetilde {I},D,J)\) is known as a fuzzy-crisp formal decision context, where \(\widetilde {I}:G\times M \rightarrow [0,1]\) and J : G × D →{0, 1}.

Definition 6.28

Let \((G,M,\widetilde {I},D,J)\) be a fuzzy-crisp formal decision context. For any k 1, k 2 ∈ D, if \(\mathcal {H}^{d}(k_{1})\cap \mathcal {H}^{d}(k_{2})=\emptyset \), then we say that \((G,M,\widetilde {I},D,J)\) is a regular fuzzy-crisp formal decision context.

Generally speaking, constructing a fuzzy concept lattice in a standard fuzzy context is sometimes quite complicated, as it is completed in exponential time complexity in the worst case. Hence, GrC should be introduced into the process of generating fuzzy concept lattices for greatly reducing the amount of calculation.

Let \((G,M,\widetilde {I})\) be a fuzzy formal context. \(\widetilde {\mathcal {F}}^{c}:2^{G}\rightarrow L^{M}\) and \(\widetilde {\mathcal {H}}^{c}:L^{M}\rightarrow 2^{G}\) are supposed to be two mappings. Hence, X (a) and \(\widetilde {B}^{\ast }\) in Definition 4 can be rewritten as \(\widetilde {\mathcal {F}}^{c}(X)(a)\) and \(\widetilde {\mathcal {H}}^{c}(\widetilde {B})\), respectively. Especially, for an object set {x}(x ∈ G), \(\widetilde {\mathcal {F}}^{c}(\{x\})(a)\) is abbreviated as \(\widetilde {\mathcal {F}}^{c}(x)\) for brevity.

Definition 6.29

Let \((G,M,\widetilde {I},D,J)\) be a fuzzy-crisp formal decision context, and \(\widetilde {\mathcal {F}}^{c}:2^{G}\rightarrow L^{M}\),\(\widetilde {\mathcal {H}}^{c}:L^{M}\rightarrow 2^{G}\) and \(\mathcal {F}^{d}:2^{G}\rightarrow 2^{D}\),\(\mathcal {H}^{d}:2^{D}\rightarrow 2^{G}\) be four mappings. For any x ∈ G, \((\widetilde {\mathcal {H}}^{c}\widetilde {\mathcal {F}}^{c}(x),\widetilde {\mathcal {F}}^{c}(x))\) and \((\mathcal {H}^{d}\mathcal {F}^{d}(x),\mathcal {F}^{d}(x))\) are called a fuzzy conditional granular concept and classical decision granular concept, respectively. The sets of all fuzzy conditional granular concepts and classical decision granular concepts are respectively represented as follows:

$$\displaystyle \begin{aligned} &\mathcal{G}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}=\{(\widetilde{\mathcal{H}}^{c}\widetilde{\mathcal{F}}^{c}(x),\widetilde{\mathcal{F}}^{c}(x))|x\in G\},\\ &\mathcal{G}_{\mathcal{F}^{d}\mathcal{H}^{d}}=\{(\mathcal{H}^{d}\mathcal{F}^{d}(x),\mathcal{F}^{d}(x))|x\in G\}, \end{aligned} $$

where \(\mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) and \(\mathcal {G}_{\mathcal {F}^{d}\mathcal {H}^{d}}\) are respectively referred to as the fuzzy conditional concept space and classical decision concept space.

It should be pointed out that fuzzy concept lattice has a good performance on classification but is very time-consuming. To the best of our knowledge, the reason is that fuzzy concept lattice may consist of many redundant elements. So, similar to classical concept lattice, it is better to replace fuzzy concept lattice with fuzzy concept space (only containing part of elements of fuzzy concept lattice) in achieving classification tasks with the purpose of improving learning efficiency.

Property 6.5

Let \((G,M,\widetilde {I},D,J)\) be a fuzzy-crisp formal decision context. For any \((X_{1},\widetilde {B})\in \mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) and \((X_{2},K)\in \mathcal {G}_{\mathcal {F}^{d}\mathcal {H}^{d}}\), if X 1 ⊆ X 2, and \(X_{1},\widetilde {B},X_{2}\) and K are nonempty, then the object set X 1 is connected with the decision attribute set K under the conditional attribute set \(\widetilde {B}\).

Proof

The proof is immediate from Definition 6.3 and Property 6.2. □

From Definition 6.29 and Property 6.5, we know that an object can also be connected with a label in a fuzzy-crisp formal decision context.

Based on the above discussion, the complete algorithm of constructing two concept spaces (including a fuzzy conditional concept space and classical decision concept space) is presented in Algorithm 6.12.

Algorithm 6.12 Constructing two concept spaces

1.4.2.2 B. Object-Oriented Fuzzy Conceptual Clustering

In order to generate fuzzy ontologies, a fuzzy conceptual clustering [50] was adopted in [67]. In fact, it was based on a crisp-crisp variable threshold concept lattice and implemented conceptual clustering via fuzzy sets intersection and union. However, to adapt to granular concepts based on a crisp-fuzzy variable threshold concept lattice, we need to consider the following notions.

Let \((G,M,\widetilde {I})\) be a fuzzy formal context. For any \((X,\widetilde {B}) \in \mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\), |X| is called the object-oriented cardinality with reference to \((X,\widetilde {B})\).

Definition 6.30

Let \((X_{j},\widetilde {B}_{j})\) be a fuzzy granular concept and \((X_{i},\widetilde {B}_{i})\) be its sub-concept, then the object-oriented fuzzy concept similarity (object-oriented FCS) is defined as follows:

$$\displaystyle \begin{aligned} \theta ^{o}=C^{O}(X_{i},X_{j})=\frac{|X_{i}\bigcap X_{j}|}{|X_{i}\bigcup X_{j}|}. \end{aligned} $$
(6.41)

Definition 6.31

Let \((X_{j},\widetilde {B}_{j})\) and \((X_{l},\widetilde {B}_{l})\) be two fuzzy granular concepts, then the attribute-oriented fuzzy concept similarity (attribute-oriented FCS) is defined as follows:

$$\displaystyle \begin{aligned} \theta ^{a}=C^{A}(\widetilde{B}_{j},\widetilde{B}_{l})=||\widetilde{B}_{j}-\widetilde{B}_{l}||{}_{2}^{2}. \end{aligned} $$
(6.42)

Definition 6.32

Let \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) be a sub-concept space of \(\mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\). For any \((X_{i},\widetilde {B}_{i}) \in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\), we say that \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) is an object-oriented conceptual cluster of the concept space with an object-oriented FCS threshold λ if the following properties hold:

  1. 1.

    There exists a supremum concept \((X_{p},\widetilde {B}_{p})\in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) that is not similar to any of its super-concepts.

  2. 2.

    There exists at least one super-concept \((X_{j},\widetilde {B}_{j}) \in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) such that C O(X i, X j) > λ when X i ≠ X p.

  3. 3.

    Any fuzzy concept \((X_{i},\widetilde {B}_{i})\) only belongs to one object-oriented conceptual cluster \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\).

Definition 6.33

Let \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) be an object-oriented conceptual cluster. For \((X_{1},\widetilde {B}_{1}),\\ (X_{2},\widetilde {B}_{2}),\dots ,(X_{p},\widetilde {B}_{p})\in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}(p=|\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}|)\), let \(X_{S_{\lambda }}=\bigcup\limits _{i=1}^{p}X_{i}\) and \(\widetilde {B}_{S_{\lambda }}=(\widetilde {B}_{S_{\lambda }}(a_{1}),\widetilde {B}_{S_{\lambda }}(a_{2}),\dots ,\widetilde {B}_{S_{\lambda }}(a_{|M|}))\), where \(\widetilde {B}_{S_{\lambda }}(a_{j})=\frac {1}{p}\sum\limits _{i=1}^{p}\widetilde {B}_{i}(a_{j})\ (j\in \{1,2,\dots , \\ |M|\})\). Then we say that the crisp-fuzzy pair \((X_{S_{\lambda }}, \widetilde {B}_{S_{\lambda }})\) is a pseudo concept induced by the object-oriented conceptual cluster \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\).

In what follows, the pseudo concept \((X_{S_{\lambda }},\widetilde {B}_{S_{\lambda }})\) is called the representation of the object-oriented conceptual cluster \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\). Note that the process of generating a new pseudo concept is known as concept generation. Hereinafter, we do not distinguish pseudo concepts from fuzzy concepts since pseudo concepts are only intermediate variables in the subsequent fuzzy conceptual clustering. In other words, sometimes we also call pseudo concepts as fuzzy concepts when no confusion exists.

Statistically speaking, Definition 6.33 can completely characterize a new fuzzy concept. However, in cognitive science, concept cognition was often considered to be incremental due to individual cognitive limitations and incomplete cognitive environments. Inspired by this issue, the process of constructing a new fuzzy concept can be rephrased as follows.

Definition 6.34

Let \((X_{p},\widetilde {B}_{p})\) be the supremum concept of \(\mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\). For \((X_{1},\widetilde {B}_{1}),\\ (X_{2},\widetilde {B}_{2}),\dots ,(X_{p},\widetilde {B}_{p})\in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\), each dimension of the intent of a new fuzzy concept \((X_{S_{\lambda }},\widetilde {B}_{S_{\lambda }})\) can be rewritten as follows:

$$\displaystyle \begin{aligned} \widetilde{B}_{S_{\lambda}}(a_{j})=&\frac{1}{2^{p-1}}(\widetilde{B}_{1}(a_{j})+\widetilde{B}_{2}(a_{j})+2\widetilde{B}_{3}(a_{j})+\\ &4\widetilde{B}_{4}(a_{j})+,\dots,+2^{p-2}\widetilde{B}_{p}(a_{j})), \end{aligned} $$
(6.43)

where j ∈{1, 2, …, |M|}.

Theorem 6.4

Let \(\widetilde {B}_{S_{\lambda }}(a_{j})\) be any dimension of the intent of a new fuzzy concept \((X_{S_{\lambda }},\widetilde {B}_{S_{\lambda }})\) . Then we have

$$\displaystyle \begin{aligned} \frac{\widetilde{B}_{p}(a_{j})}{2} \leq \widetilde{B}_{S_{\lambda}}(a_{j}) \leq 1. \end{aligned} $$
(6.44)

Proof

It is immediate from Definitions 6.24 and 6.34. □

For any \((X_{i},\widetilde {B}_{i}), (X_{j},\widetilde {B}_{j})\in \mathcal {G}^{S_{\lambda }}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\), if \((X_{j},\widetilde {B}_{j})\) is a super-concept of \((X_{i},\widetilde {B}_{i})\), we say that \((X_{j},\widetilde {B}_{j})\) presents more strongly conceptual representation ability than \((X_{i},\widetilde {B}_{i})\). Equation (6.43) represents that the process of incremental cognition for concept formation by means of the hierarchical relations between sub-concepts and super-concepts, and the coefficient of each dimension will be heighten along with the increase of conceptual representation ability. Equation (6.44) denotes that the upremum concept has a great influence on the process of constructing new fuzzy concepts.

Definition 6.35

Let \(\mathcal {G}^{S_{\lambda ,1}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}},\mathcal {G}^{S_{\lambda ,2}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}},\dots , \mathcal {G}^{S_{\lambda ,m}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) be a partition of \(\mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) with an object-oriented FCS threshold λ. Then a new concept space can be defined as follows:

$$\displaystyle \begin{aligned} \mathcal{G}^{S_{\lambda,*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}=\bigcup\limits_{i=1}^{m}\mathcal{G}^{S_{\lambda,i}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} =\bigcup\limits_{i=1}^{m}(X_{S_{\lambda,i}},\widetilde{B}_{S_{\lambda,i}}). \end{aligned} $$
(6.45)

Theorem 6.5

Let \(\mathcal {G}^{S_{\lambda ,*}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) be a concept space with an object-oriented FCS threshold λ. We have

$$\displaystyle \begin{aligned} 1 \leq |\mathcal{G}^{S_{\lambda,*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}| \leq |\mathcal{G}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}|. \end{aligned} $$
(6.46)

Proof

The proof of Theorem 6.5 can be found in the original paper [43]. □

Based on the above theory, the procedure of object-oriented fuzzy conceptual clustering is summarized in Algorithm 6.13.

Algorithm 6.13 Object-oriented fuzzy conceptual clustering method

1.4.3 Theoretical Analysis

From Definition 6.35 and Theorem 6.5, we know that the object-oriented FCS threshold has a significant impact on the construction of a new concept space. Hence, it is very necessary to select an optimal (or approximate optimal) λ for each dataset.

Let λ = λ(i) (i ∈{1, 2, …, n}), and λ(i) ∝ i. For all the newly constructed concept spaces with different λ(i), we denote

$$\displaystyle \begin{aligned} \!\!\left[ \begin{array}{c} \mathcal{G}^{S_{\lambda(1),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \mathcal{G}^{S_{\lambda(2),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \vdots \\ \mathcal{G}^{S_{\lambda(n),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \end{array} \right]\!\!\!=\!\!\! \left[ \begin{array}{cccc} \mathcal{G}^{S_{\lambda(1),1}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\mathcal{G}^{S_{\lambda(1),2}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\cdots &\mathcal{G}^{S_{\lambda(1),m_{1}}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \mathcal{G}^{S_{\lambda(2),1}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\mathcal{G}^{S_{\lambda(2),2}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\cdots &\mathcal{G}^{S_{\lambda(2),m_{2}}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \vdots &\vdots &\ddots &\vdots \\ \mathcal{G}^{S_{\lambda(n),1}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\mathcal{G}^{S_{\lambda(n),2}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}} &\cdots &\mathcal{G}^{S_{\lambda(n),m_{n}}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \end{array} \right], \end{aligned} $$
(6.47)

where \(m_{i}=|\mathcal {G}^{S_{\lambda (i),*}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}|\), and \(\mathcal {G}^{S_{\lambda (i),*}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) is computed with λ(i).

In Eq. (6.47), we say that \(\mathcal {G}^{S_{\lambda (i),j}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}(j\in \{1,2,\dots ,m_{i}\})\) is a conceptual subcluster of \(\mathcal {G}^{S_{\lambda (i),*}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\). Meanwhile, according to Definition 6.33, each object-oriented conceptual cluster can be represented as a new fuzzy concept. Hence, Eq. (6.47) can be rewritten as Eq. (6.48).

Note that there is only the fuzzy conditional concept space \(\mathcal {G}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) which will be influenced by the object-oriented FCS threshold λ(i). The concept space \(\mathcal {G}^{S_{\lambda (i),*}}_{\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}}\) can be simplified by omitting the suffix \(\widetilde {\mathcal {F}}^{c}\widetilde {\mathcal {H}}^{c}\) when no confusion exists, namely \(\mathcal {G}^{S_{\lambda (i),*}}\) .

Property 6.6

Let \(\mathcal {G}^{S_{\lambda (i),*}}\) be a set of object-oriented conceptual clusters with the object-oriented FCS threshold λ(i). Then we have

$$\displaystyle \begin{aligned} \left[ \begin{array}{c} \mathcal{G}^{S_{\lambda(1),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \mathcal{G}^{S_{\lambda(2),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \vdots \\ \mathcal{G}^{S_{\lambda(n),*}}_{\widetilde{\mathcal{F}}^{c}\widetilde{\mathcal{H}}^{c}}\\ \end{array} \right]= \left[ \begin{array}{cccc} \big(X_{S_{\lambda(1),1}},\widetilde{B}_{S_{\lambda(1),1}}\big) &\big(X_{S_{\lambda(1),2}},\widetilde{B}_{S_{\lambda(1),2}}\big) &\cdots &\big(X_{S_{\lambda(1),m_{1}}},\widetilde{B}_{S_{\lambda(1),m_{1}}}\big) \\ \big(X_{S_{\lambda(2),1}},\widetilde{B}_{S_{\lambda(2),1}}\big) &\big(X_{S_{\lambda(2),2}},\widetilde{B}_{S_{\lambda(2),2}}\big) &\cdots &\big(X_{S_{\lambda(2),m_{2}}},\widetilde{B}_{S_{\lambda(2),m_{2}}}\big) \\ \vdots &\vdots &\ddots &\vdots \\ \big(X_{S_{\lambda(n),1}},\widetilde{B}_{S_{\lambda(n),1}}\big) &\big(X_{S_{\lambda(n),2}},\widetilde{B}_{S_{\lambda(n),2}}\big) &\cdots &\big(X_{S_{\lambda(n),m_{n}}},\widetilde{B}_{S_{\lambda(n),m_{n}}}\big) \\ \end{array} \right]. \end{aligned} $$
(6.48)
$$\displaystyle \begin{aligned} |\mathcal{G}^{S_{\lambda(i),*}}| \propto \lambda(i). \end{aligned} $$
(6.49)

Proof

The proof can be derived by means of λ(i) = λ, and Definition 6.35. □

In the above discussion, we only consider the situation that there exists one concept cluster in FCLM. However, in the real-life world, studying the situation of multiple concept clusters with the label information is also highly desirable, as there are at least two concept clusters for classification tasks.

We denote by G = {x 1, x 2, …, x m} a set of instances and \(\mathcal {K}=\{1,2,\dots ,l\}\) the label space. There does exist a partition of the instances into l clusters \(\mathcal {C}_{1},\mathcal {C}_{2},\dots ,\mathcal {C}_{l}\) by means of the label information such that they can cover all the instances, and formally, \(\mathcal {C}_{1}\cup \mathcal {C}_{2}\cup \dots \cup \mathcal {C}_{l}=G\), where \(\mathcal {C}_{i}\cap \mathcal {C}_{j}=\emptyset \ (\forall i\neq j)\). Meanwhile, we denote the corresponding fuzzy conceptual clusters by \(\mathcal {G}^{S_{\lambda (i),*}}_{1},\mathcal {G}^{S_{\lambda (i),*}}_{2},\dots ,\mathcal {G}^{S_{\lambda (i),*}}_{l}\) with λ(i). Moreover, the set of all fuzzy conceptual clusters with λ(i) is denoted by \(\mathcal {C}^{S_{\lambda (i)}}\), namely \(\mathcal {C}^{S_{\lambda (i)}}=\{\mathcal {G}^{S_{\lambda (1),*}}_{1},\mathcal {G}^{S_{\lambda (1),*}}_{2},\cdots ,\mathcal {G}^{S_{\lambda (1),*}}_{l}\}\). For different object-oriented FCS thresholds, we further denote

$$\displaystyle \begin{aligned} \left[ \begin{array}{c} \mathcal{C}^{S_{\lambda(1)}}\\ \mathcal{C}^{S_{\lambda(2)}}\\ \vdots \\ \mathcal{C}^{S_{\lambda(n)}}\\ \end{array} \right]\!\!\!=\!\!\! \left[ \begin{array}{cccc} \mathcal{G}^{S_{\lambda(1),*}}_{1} &\mathcal{G}^{S_{\lambda(1),*}}_{2} &\cdots &\mathcal{G}^{S_{\lambda(1),*}}_{l}\\ \mathcal{G}^{S_{\lambda(2),*}}_{1} &\mathcal{G}^{S_{\lambda(2),*}}_{2} &\cdots &\mathcal{G}^{S_{\lambda(2),*}}_{l}\\ \vdots &\vdots &\ddots &\vdots \\ \mathcal{G}^{S_{\lambda(n),*}}_{1} &\mathcal{G}^{S_{\lambda(n),*}}_{2} &\cdots &\mathcal{G}^{S_{\lambda(n),*}}_{l}\\ \end{array} \right]. \end{aligned} $$
(6.50)

Our aim is to select an optimal λ(i) in the interval [0,1] for each dataset. Let \((X_{r},\widetilde {B}_{r})\ (r\in \{1,2,\dots ,m\})\) be a fuzzy granular concept. Then, the objective function can be formulated as

$$\displaystyle \begin{aligned} E(\lambda(i),j)&=\min\limits_{i\in\mathcal{I},j\in\mathcal{J},k^{\prime}}\sum_{r=1}^{m}||(X_{r},\widetilde{B}_{r})-\mathcal{G}^{S_{\lambda(i),j}}_{k^{\prime}}||{}_{2}^{2}-\\ &\max\limits_{i\in\mathcal{I}}\min\limits_{j\in\mathcal{J}}\sum_{k^{\prime\prime}\in\overline{\mathcal{K}}}\sum_{r=1}^{m}||(X_{r},\widetilde{B}_{r})-\mathcal{G}^{S_{\lambda(i),j}}_{k^{\prime\prime}}||{}_{2}^{2}\\ &\mbox{s.t.}\ \ m_{i}\propto \lambda(i), 0\leq \lambda(i)\leq 1, \end{aligned} $$
(6.51)

where \(\mathcal {I}=\{1,2,\dots ,n\}\), \(\mathcal {J}=\{1,2,\dots ,m_{i}\}\), \(\overline {\mathcal {K}}=\mathcal {K}\setminus \{k^{\prime }\}\), and k represents the real class label of the fuzzy granular concept \((X_{r},\widetilde {B}_{r})\). Hence, in Eq. (6.51), the first item denotes that samples are classified into the ground truth conceptual subcluster, while the second item indicates the opposite situation.

Let \((X_{S_{\lambda (i),j}}^{k},\widetilde {B}_{S_{\lambda (i),j}}^{k})\ (k\in \mathcal {K})\) be the representation of the conceptual subcluster \(\mathcal {G}^{S_{\lambda (i),j}}_{k}\). For any fuzzy granular concept \((X_{r},\widetilde {B}_{r})\), it can be considered as an instance x r with M-dimensional features. Therefore, according to Definition 6.31 and Eq. (6.48), the objective function can be reformulated as

$$\displaystyle \begin{aligned} E(\lambda(i),j)&=\min\limits_{i\in\mathcal{I},j\in\mathcal{J},k^{\prime}}\sum_{r=1}^{m}||\widetilde{B}_{r}-\widetilde{B}_{S_{\lambda(i),j}}^{k^{\prime}}||{}_{2}^{2}-\\ &\max\limits_{i\in\mathcal{I}}\min\limits_{j\in\mathcal{J}}\sum_{k^{\prime\prime}\in\overline{\mathcal{K}}}\sum_{r=1}^{m}||\widetilde{B}_{r}-\widetilde{B}_{S_{\lambda(i),j}}^{k^{\prime\prime}}||{}_{2}^{2}\\ &\mbox{s.t.}\ \ m_{i}\propto \lambda(i), 0\leq \lambda(i)\leq 1. \end{aligned} $$
(6.52)

Based on Eq. (6.48) and Property 6.6, we know that the variable j is dependent on another variable λ(i). Hence, we can optimize the objective function of our FCLM by means of updating λ(i):

$$\displaystyle \begin{aligned} \widehat{\lambda}(i)=\operatorname*{\arg\min}_{i\in\mathcal{I},j\in\mathcal{J}}E(\lambda(i),j) \\ \mbox{s.t.}\ \ m_{i}\propto \lambda(i), 0\leq \lambda(i)\leq 1. \end{aligned} $$
(6.53)

In theory, we can obtain an optimal \(\widehat {\lambda }(i)\) by solving Eq. (6.53) directly. Unfortunately, it is quite difficult to obtain analytical solutions due to lacking of a concrete functional expression between m i and λ(i). Hence, we select an approximate optimal \(\widehat {\lambda }(i)\) by a method similar to grid search. The complete procedure for selecting an approximate optimal solution (see Algorithm 6.14 for details) is proposed based on the above discussion.

Algorithm 6.14 Select λ^(i) for FCLM

2 Label Proportion for Learning

2.1 A Fast Algorithm for Multi-Class Learning from Label Proportions

Learning from label proportions (LLP) is a new kind of learning problem which has attracted wide interest in machine learning. Different from the well-known supervised learning, the training data of LLP is in form of bags and only the proportion of each class in each bag is available. In this subsection, we propose a fast algorithm called multi-class learning from label proportions by extreme learning machine (LLP-ELM), which takes advantage of extreme learning machine with fast learning speed to solve multi-class learning from label proportions.

2.1.1 Background

In this section, we give a brief introduction of the traditional extreme learning machine [21, 22]. Figure 6.5 shows the architecture of ELM. In detail, it is a single-hidden layer feed-forward networks with three parts: input neurons, hidden neurons and output neurons. In particular, h(x) = [h 1(x), …, h L(x)] is nonlinear feature mapping of ELM with the form of h j(x) = g(w j .x + b j) and β j = [β j1, …, β jc]T, j = 1, …, L is the output weights between the jth hidden layer and the output nodes.

Fig. 6.5
figure 5

The architecture of ELM. In detail, it is a single-hidden layer feed-forward network with three parts: input neurons, hidden neurons and output neurons. In particular, h(x) = [h 1(x), …, h L(x)] is nonlinear feature mapping of ELM with the form of h j(x) = g(w j .x + b j) and β j = [β j1, …, β jc]T, j = 1, …, L is the output weights between the jth hidden layer and the output nodes

Given N samples (x i, t i), i = 1, …, N, where x i = [x i1, …, x id]T denotes the input feature vectors and t i = [t i1, …, t ic]T is the corresponding label in a one-hot fashion. In particular, c and d respectively represent the total classes and feature number. Consequently, a standard feed-forward neural network with L hidden nodes can be expressed as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{j=1}^L \boldsymbol{\beta_{j}} g(\mathbf{w_j.x_i} + b_j) = \mathbf{o_{i}}, i = 1,\ldots,N, \end{array} \end{aligned} $$
(6.54)

where w j = [w j1, w j2, …, w jL]T is the weight vector between the jth hidden neuron and the input neurons, and β j = [β j1, β j2, …, β jc]T, j = 1, …, L is the weight vector connecting the output neuron and the jth hidden neurons. According to [21], the ELM can approximate those N samples to zero error with the equation \(\sum _{i=1}^N \|\mathbf {o_i - t_i}\| = 0\). Thus, the above equations can be expressed as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{j=1}^L \boldsymbol{\beta_j} g(\mathbf{w_j.x_i} + b_j) = \mathbf{t_{i}}, i = 1,\ldots,N. \end{array} \end{aligned} $$
(6.55)

In particular, we can use matrix to express the above N equations with form of:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{H\boldsymbol{\beta} = T}, \end{array} \end{aligned} $$
(6.56)

where H is the hidden layer output matrix of the single-hidden layer feed-forward network and T is output matrix. More specifically, H and T have the form of:

(6.57)

and

(6.58)

In practice, the hidden node parameters (w,b) of ELM are randomly generated and then fixed without iteratively tuning, which is different to the traditional BP neural networks [21]. As a result, training an ELM is equivalent to find the optimal solution to β, which is in defined as:

(6.59)

Furthermore, β can computed by the following expression:

$$\displaystyle \begin{aligned} \boldsymbol{\beta}^* = {\mathbf{H}}^{\dag}\mathbf{T} \end{aligned} $$
(6.60)

where H is the Moore-Penrose generalized inverse of matrix H.

2.1.2 The LLP-ELM Algorithm

In this section, we propose a fast method for multi-class learning from label proportions algorithm called LLP-ELM, which employs extreme learning machine to solve multi-class LLP problem. In order to leverage extreme learning machine to LLP, we reshape the hidden layer output matrix H and the training data target matrix T to new forms, such that H is in bag level and T contains the proportion information instead of a label one.

2.1.2.1 A. Learning Setting

The LLP problem is described by a set of training data, which is divided into several bags. Furthermore, compared to the traditional supervised learning, we only know the proportions of different categories in each bag instead of the ground-truth labels. In this paper, we consider the situation that different bags are disjoint, and the nth bag of the training data can be denoted as B n, n = 1, …, h. Consequently, the total training data is in form of:

$$\displaystyle \begin{aligned} \begin{array}{rcl} D = B_1 \cup B_2 \cup \ldots. \cup B_h \\ B_i \cap B_j = \emptyset,\forall i \neq j.\end{array} \end{aligned} $$
(6.61)

where there are n bags and N is the number of total instances. Each bag consists of m n instances with the constraint \(\sum _{n=1}^{h} m_n = N\), and can be expressed as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} B_n = \{x_n^1, . . .,x_n^{m_n}\}, n \in \{1,2,\ldots,h\}. \end{array} \end{aligned} $$
(6.62)

Meanwhile, p n is the corresponding class proportion vector of B n and c represents the total classes number. More specifically, p n can be written as a vector form:

(6.63)

where the mth element \(p_n^m\) is the proportion of the mth class in the nth bag with the constraint \(\sum _{m=1}^c p_n^m = 1\). Furthermore, the total proportion information can be defined in form of matrix:

(6.64)
2.1.2.2 B. The LLP-ELM Framework

From the above learning setting of LLP, a classifier in instance level is the final objective. To this end, we modify the original equations in ELM to the new equations in bag level. Specifically, we add all the equations in each bag straightforward, and the final equations in nth bag can be expressed as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{j=1}^L \sum_{k=1}^{m_n} \boldsymbol{\beta_j} g(\mathbf{w_j.x_{nk}} + b_j) = \sum_{k=1}^{m_n}\mathbf{t_{nk}}, n = 1,\ldots,h {} \end{array} \end{aligned} $$
(6.65)

where t nk is the real label for the kth instances in nth bag. Obviously, the real label information in the right part is inaccessible to us, with only label proportions in each bag available. To this end, we derive the right part of the above equation as the following form:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{k=1}^{m_n}\mathbf{t_{nk}} = m_n*\mathbf{p_n}, n = 1,\ldots,h {} \end{array} \end{aligned} $$
(6.66)

where p n is the label proportion of nth bag. Substituting the formula (6.66) to (6.65), we can naturally obtain the following equations:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{j=1}^L \boldsymbol{\beta_j} \sum_{k=1}^{m_n} g(\mathbf{w_j.x_{nk}} + b_j) = m_n*\mathbf{p_j}, n =1,\ldots,h, \end{array} \end{aligned} $$
(6.67)

In particular, similar to the method from ELM [21], we can write the above equations in the form of matrix computing as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{H_p\boldsymbol{\beta} = P} {} \end{array} \end{aligned} $$
(6.68)

where H p is the hidden layer output matrix in the bag level, and P is the training data target proportion matrix. More specifically, H p and P are given in form of:

and

Meanwhile, the final solution β is the same with the original form in ELM with dimension L × c. Again, the optimal solution to (6.68) is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\beta}^* = {\mathbf{H}}_{\mathbf{p}}^{\dag}\mathbf{P} \end{array} \end{aligned} $$
(6.69)

where \(\mathbf {H_p^{\dag }}\) is the Moore-Penrose generalized inverse of matrix H p.

In order to obtain a better generalization performance of ELM, we also follow the method from [22] to study the regularized ELM. In detail, the final objective function of ELM is formulated as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle \min_{\boldsymbol{\beta} \in R^{L\times c}} \frac{1}{2} \|\boldsymbol{\beta}\|{}^2 + \frac{C}{2}\sum_{i=1}^N \|\mathbf{e_i}\|{}^2 \\ & &\displaystyle s.t. \ \ \mathbf{h(x_i)\boldsymbol{\beta} = t_i^T - e_i^T,} i = 1,\ldots,N, {} \end{array} \end{aligned} $$
(6.70)

in which the first term of the objective function is a regularization term and C is a parameter to make a trade-off between the first and second term.

We equivalently reformulate the problem (6.70) as follows by substituting the constraints to its objective function:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \min_{\beta \in R^{L\times c}} L_{ELM} = \frac{1}{2}\|\boldsymbol{\beta}\|{}^2 + \frac{C}{2}\| \mathbf{T - H\boldsymbol{\beta}} \|{}^2 {} \end{array} \end{aligned} $$
(6.71)

Note that the second term of (6.71) can be replaced by \(\frac {C}{2}\| \mathbf {P - H_p\boldsymbol {\beta }} \|{ }^2\), which is the matrix form in bag level. In other words, the final unconstrained optimization problem can be written as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \min_{\beta \in R^{L\times c}} L_{ELM} = \frac{1}{2}\|\boldsymbol{\beta}\|{}^2 + \frac{C}{2}\| \mathbf{P - H_p}\boldsymbol{\beta} \|{}^2 {} \end{array} \end{aligned} $$
(6.72)

In practice, the final objection is widely known as the ridge regression or regularized least squares.

2.1.2.3 C. How to Solve the LLP-ELM

We follow the strategy from [22] to solve (6.72), and the final purpose is to minimize the training error as well as the norm of the output weights. Obviously, the final objective function is a convex problem, which is always solved by way of gradient. More specifically, by setting the gradient of (6.72) to zero with respect to β, we can obtain the following expression:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\beta} - C\mathbf{H_p^T(P - H_p\boldsymbol{\beta}) = 0}. {} \end{array} \end{aligned} $$
(6.73)

This yields

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{(\frac{I}{C} + H_p^TH_p)\boldsymbol{\beta} = H_p^TP}, \end{array} \end{aligned} $$
(6.74)

where I is an identity matrix with dimension L.

The above equation is very intuitive, and we can obtain the final optimization result by inverting a L×L matrix directly. However, it is less efficient to directly invert a L×L matrix when the number of bag is less than the number of hidden neurons(h <  L). Therefore, there are two methods which are shown in Remark 1 and Remark 2 . In summary, in the case where the number of bags are plentiful than hidden neurons, we use Remark 1 to compute the output weights, otherwise we use Remark 2 .

Remark 1

The solution for formula (6.73) when h >  L.

  • H p has more rows than columns, which means the number of bag is larger than the number of hidden neurons.

  • By inverting a L×L matrix directly and multiplying both sides by \(\mathbf {(H_p^TH_p + \frac {I}{C})^{-1}}\) , we can obtain the following expression

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\beta} = \mathbf{(H_p^TH_p + \frac{I}{C})^{-1}H_p^TP}, \end{array} \end{aligned} $$
    (6.75)

    which is the optimal solution of (6.73).

Remark 2

The solution for formula (6.73) when h <  L.

  • Notice that H p is full row rank and \(\mathbf {H_pH_p^T}\) is invertible when h <  L.

  • Restrict β to be a linear combination of the row in \(\mathbf {H_p:\boldsymbol {\beta } = H_p^T\boldsymbol {\alpha }}\)

  • Substitute \(\boldsymbol {\beta } = {\mathbf {H}}_{\mathbf {p}}^{\mathbf {T}}\boldsymbol {\alpha }\) into (6.73), and multiply by \(\mathbf {(H_pH_p^T)^{-1}H_p}\).

  • By the above step, we can obtain the following equation:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\alpha} - C\mathbf{(P - H_pH_p^T\boldsymbol{\alpha}) = 0}. \end{array} \end{aligned} $$
    (6.76)
  • As a result, the final optimal solution of (6.73) is in form of

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\beta} = \mathbf{H_p^T\boldsymbol{\alpha}} = \mathbf{H_p^T(H_pH_p^T + \frac{I}{C})^{-1}P = 0}. \end{array} \end{aligned} $$
    (6.77)

The solution process of LLP-ELM model can be concluded to the following two steps:

  • Compute training data target proportion matrix P and the hidden layer output matrix H p.

  • Obtain the final optional solution of β according to Remark 1 or Remark 2 . The details of the process are shown in Algorithm 6.15 .

Algorithm 6.15 LLP-ELM

2.1.2.4 D. Computational Complexity

From the Remark 1 and Remark 2 , we can observe that the main time cost of our method is to calculate the matrix inversion. Furthermore, the dimension of matrix is minimum of the number of bags h and the hidden neurons L, which is determined by us. As we all know, the complexity of matrix inversion is proportional to the O 3, where O is the dimension of matrix, and is equal to Min(L,h) in this paper.

2.2 Learning from Label Proportions with Generative Adversarial Networks

2.2.1 Preliminaries

2.2.1.1 A. The Multi-Class LLP

Before further discussion, we formally describe multi-class LLP. For simplicity, we assume that all the bags are disjoint and let \(\mathcal {B}_i=\{{\mathbf {x}}_i^1,{\mathbf {x}}_i^2,\cdots ,{\mathbf {x}}_i^{N_i}\}, i = 1,2,\cdots ,n\) denote bags in training set. Then, training data is \(\mathcal {D}=\mathcal {B}_1 \cup \mathcal {B}_2 \cup \cdots \cup \mathcal {B}_n, \mathcal {B}_i \cap \mathcal {B}_j=\emptyset ,\forall i \neq j\), where the total number of bags is n.

In addition, p i is a K-element vector where the kth element \(p_i^k\) is instance proportion in \(\mathcal {B}_i\) belonging to the kth class with the constraint \(\sum _{k=1}^K p_i^k =1\) and K represents the total number of classes, i.e.,

$$\displaystyle \begin{aligned} p_i^k := \frac{|\{ j\in [1:N_i]|{\mathbf{x}}_i^j\in \mathcal{B}_i, y_i^{j*} = k\}|}{|\mathcal{B}_i|}. \end{aligned} $$
(6.78)

Here, [1 : N i] = {1, 2, ⋯ , N i} and \(y_i^{j*}\) is the unaccessible ground-truth instance-level label of \({\mathbf {x}}_i^j\). In this way, we can denote the available training data as \(\mathcal {L}=\{(\mathcal {B}_i,{\mathbf {p}}_i)\}_{i=1}^n\). The goal of LLP is to learn an instance-level classifier based on this kind of dataset.

2.2.1.2 B. Deep Discriminant Approach for LLP

In terms of deep learning, DLLP firstly leveraged CNNs to solve multi-class LLP problem [1]. Since CNNs can give a probabilistic interpretation for classification, it is straightforward to adapt cross-entropy loss into a bag-level version by averaging the probability outputs in every bag as the proportion estimation. To this end, inspired by [71], DLLP reshaped standard cross-entropy loss by substituting instance-level label with label proportion, in order to meet the proportion consistency.

In detail, suppose that \(\tilde {\mathbf {p}}_i^j=p_\theta (\mathbf {y}|{\mathbf {x}}_i^j)\) is the vector-valued CNNs output for \({\mathbf {x}}_i^j\), where θ is the network parameter. Let ⊕ be element summation operator, then the bag-level label proportion in the ith bag is obtain by incorporating the element-wise posterior probability:

$$\displaystyle \begin{aligned} \overline{\mathbf{p}}_i = \frac{1}{N_i}\bigoplus_{j=1}^{N_i} \tilde{\mathbf{p}}_i^j=\frac{1}{N_i}\bigoplus_{j=1}^{N_i}p_\theta(\mathbf{y}|{\mathbf{x}}_i^j), \end{aligned} $$
(6.79)

In order to smooth max function [5], \(\tilde {\mathbf {p}}_i^j\) is in a vector-type softmax manner to produce the distribution for class probabilities. Taking log as element-wise logarithmic operator, objective of DLLP can be intuitively formulated using cross-entropy loss \(L_{prop}=-\sum _{i=1}^{n} {\mathbf {p}}_i^\intercal log(\overline {\mathbf {p}}_i)\). It penalizes the difference between prior and posterior probabilities in bag-level, and commonly exists in GAN-based SSL [61].

2.2.1.3 C. Entropy Regularization for DLLP

Following the entropy regularization strategy [18], we can introduce an extra loss E in with a trade-off hyperparameter λ to constrain instance-level output distribution in a low entropy accordingly:

$$\displaystyle \begin{aligned} L = L_{prop}+\lambda E_{in} = -\sum_{i =1}^{n} {\mathbf{p}}_i^\intercal log(\overline{\mathbf{p}}_i) -\lambda\sum_{i=1}^{n}\sum_{j=1}^{N_i} (\tilde{\mathbf{p}}_i^j)^\intercal log(\tilde{\mathbf{p}}_i^j). \end{aligned} $$
(6.80)

This extension is similar to a KL divergence between two distributions. It takes advantage of DNN’s output distribution to cater to the label proportions requirement, as well as minimizing output entropy as a regularization term to guarantee strong true-fake belief. This is believed to be linked with an inherent MAP estimation with certain prior distribution in network parameters.

2.2.2 Adversarial Learning for LLP

In this section, we propose LLP-GAN, which devotes GANs to harnessing LLP problem.

2.2.2.1 A. The Objective Function of Discriminator

We illustrate the LLP-GAN framework in Fig. 6.6. The generator is employed to generate images with input noise, which is labeled as fake. On the other hand, the discriminator yields class confidence maps for each class (including the fake one) by taking both fake and real data as the inputs. In particular, our discriminator is not only to identify whether it is a sample from the real data or not, but also to elaborately distinguish each real input’s label assignment as a K classes classifier. This idea is fairly intuitive, and we conclude its loss as the L unsup term.

Fig. 6.6
figure 6

An illustration of our LLP-GAN framework

Next, the main issue becomes how to exploit the proportional information to guide this unsupervised learning correctly. To this end, we replace the supervised information in semi-supervised GANs with label proportions, resulting in L sup, same as L prop in (6.80).

Definition 6.36

Suppose that \(\mathcal {P}\) is a partition to divide the data space into n disjoint sections. Let \(p_d^i(\mathbf {x}), i\!=\!1,2,\cdots ,n\) be marginal distributions with respect to elements in \(\mathcal {P}\) respectively. Accordingly, n bags in LLP training data spring from sampling upon \(p_d^i(\mathbf {x}), i\!=\!1,2,\cdots ,n\). In the meantime, let p(x, y) be the unknown holistic joint distribution.

We normalize the first K classes in P D(⋅|x) into the instance-level posterior probability \(\tilde {p}_{D}(\cdot |\mathbf {x})\) and compute \(\overline {\mathbf {p}}\) based on (6.79). Then, the ideal optimization problem for the discriminator of LLP-GAN is:

$$\displaystyle \begin{aligned} \max_{D} \ \ & V(G,D) = L_{unsup} + L_{sup} = L_{real} + L_{fake}\! - \lambda CE_{\mathcal{L}}(\mathbf{p},\overline{\mathbf{p}})\\ & = \sum_{i=1}^n E_{\mathbf{x} \sim p_d^i}\Big [logP_D(y\leq K|\mathbf{x})\Big] + E_{\mathbf{x} \sim p_g}\Big [logP_D(K+1|\mathbf{x})\Big]+\lambda \sum_{i=1}^n{\mathbf{p}}_i^\intercal log(\overline{\mathbf{p}}_i).\\ \end{aligned} $$
(6.81)

Here, p g(x) represents the distribution of the synthesized data.

The normalized instance-level posterior probability \(\tilde {p}_{D}(\cdot |\mathbf {x})\) is:

$$\displaystyle \begin{aligned} \tilde{p}_{D}(k|\mathbf{x})=\frac{P_D(k|\mathbf{x})}{1-P_D(K+1|\mathbf{x})}, k=1,2,\cdots,K. \end{aligned} $$
(6.82)

Note that weight λ in (6.81) is added to balance between supervised and unsupervised terms, which is a slight revision of SSL with GANs [13, 54]. Intuitively, we reckon the proportional information is too weak to fulfill supervised learning pursuit. As a result, a relatively small weight should be preferable in the experiments. However, we fix \(\lambda \!=\!1\) in the following theoretical analysis on discriminator.

Aside from identifying the first two terms in (6.81) as that in semi-supervised GANs, the cross-entropy term harnesses the label proportions consistency. In order to justify the non-triviality of this loss, we first look at its lower bound. More important, it is easier to perform the gradient method on the lower bound, because it swaps the order of log and the summation operation. For brevity, the analysis will be done in a non-parametric setting, i.e. we assume that both D and G have infinite capacity.

Remark (The Lower Bound Approximation)

Let \(p_i(k)\!=\!p_i^k\!=\!\int \! p_i(y\!=\!k|\mathbf {x})p_d^i(\mathbf {x})\mbox{d}\mathbf {x}\) be the class k proportion in the ith bag. By applying Monte-Carlo sampling, we have:

$$\displaystyle \begin{aligned} -CE_{\mathcal{L}}(\mathbf{p},\overline{\mathbf{p}}) & = \sum_{i=1}^n\!\sum_{k=1}^K\!p_i(k)log \Big [\frac{1}{N_i} \!\sum_{j=1}^{N_i}\!\tilde{p}_{D}(k|{\mathbf{x}}_i^j)\Big ]\\ &\backsimeq \! \sum_{i=1}^n\!\sum_{k=1}^K\!p_i(k)log \Big [\!\int\! p_d^i(\mathbf{x}) \tilde{p}_{D}(k|\mathbf{x}) \mbox{d}\mathbf{x}\Big ]\!\geqslant\! \sum_{i=1}^n\!\sum_{k=1}^K\!p_i(k)E_{\mathbf{x}\sim p_d^i}\!\Big [log \tilde{p}_{D}(k|\mathbf{x})\!\Big ].\\ \end{aligned} $$
(6.83)

Similar to EM mechanism for mixture models, by approximating \(-CE_{\mathcal {L}}(\mathbf {p},\overline {\mathbf {p}})\) with its lower bound, we can perform gradient ascend independently on every sample. Hence, SGD can be applied.

Property 6.7

The maximization on the lower bound in (6.83) induces an optimal discriminator D with a posterior distribution \(\tilde {p}_{D^*}(y|\mathbf {x})\), which is consistent with the prior distribution p i(y) in each bag.

Proof

Taking the aggregation with respect to one bag, for example, the ith bag, we have:

$$\displaystyle \begin{aligned} E_{\mathbf{x}\sim p_d^i}[logp(\mathbf{x})]\!&=\!E_{\mathbf{x}\sim p_d^i}\!log\!\Big[\frac{p(\mathbf{x},y)}{\tilde{p}_{D}(y|\mathbf{x})}\frac{\tilde{p}_{D}(y|\mathbf{x})}{p(y|\mathbf{x})}\!\Big]\! \\ &=\!E_{\mathbf{x}\sim p_d^i}\!\int\!p_i(y) log\!\Big[\frac{p_i(y)p(\mathbf{x}|y)}{\tilde{p}_{D}(y|\mathbf{x})}\frac{\tilde{p}_{D}(y|\mathbf{x})}{p(y|\mathbf{x})}\!\Big]\mbox{d}y\\ &=\!E_{\mathbf{x}\sim p_d^i}\!\int\! \Big[p(y_i)log\tilde{p}_{D}(y|\mathbf{x})\!+\!log\frac{p(\mathbf{x}|y)}{p(y|\mathbf{x})}\Big]\mbox{d}y\! \\ &+\!E_{\mathbf{x}\sim p_d^i}KL(p_i(y)\|\tilde{p}_{D}(y|\mathbf{x}))\\ &\geqslant\!\sum_{k=1}^K\!p_i(k)E_{\mathbf{x}\sim p_d^i}\!\Big [log \tilde{p}_{D}(k|\mathbf{x})\!\Big ]+\!\sum_{k=1}^K\!p_i(k)E_{\mathbf{x}\sim p_d^i}\!\Big [log\frac{p(\mathbf{x}|k)}{p(k|\mathbf{x})}\!\Big ] \end{aligned} $$
(6.84)

Note that the last term in (6.84) is free of the discriminator, and the aggregation can be independently performed within every bag due to the disjoint assumption. Then, maximizing the lower bound in (6.83) is equivalent to minimizing the expectation of KL-divergence between p i(y) and \(\tilde {p}_{D}(y|\mathbf {x})\). Because of the infinite capacity assumption on discriminator and the non-negativity of KL-divergence, we have:

$$\displaystyle \begin{aligned} D^*=\operatorname*{\arg\min}_{D}E_{\mathbf{x}\sim p_d^i}KL(p_i(y)\|\tilde{p}_{D}(y|\mathbf{x}))\Leftrightarrow\tilde{p}_{D^*}(y|\mathbf{x})\overset{a.e.}{=}p_i(y), \mathbf{x}\sim p_d^i(\mathbf{x}). \end{aligned} $$
(6.85)

That concludes the proof. □

Property 6.7 tells us that if there is only one bag, then \(\tilde {p}_{D^*}(y|\mathbf {x})\overset {a.e.}{=}p(y)\). However, there is normally more than one bag in LLP, the final classifier will somehow be a trade-off among all the prior proportions p i(y), i = 1, 2, ⋯ , n. Next, we will show how the adversarial learning on the discriminator helps to determine the formulation of this trade-off into a weighted aggregation.

2.2.2.2 B. Global Optimality

As shown in (6.83), in order to facilitate the gradient computation, we substitute cross entropy in (6.81) by its lower bound and denote this approximate objective function for discriminator by \(\widetilde {V}(G,D)\).

Theorem 6.6

For fixed G, the optimal discriminator D for \(\widetilde {V}(G,D)\) satisfies:

$$\displaystyle \begin{aligned} P_{D^*}(y=k|\mathbf{x})=\frac{\sum_{i=1}^n p_i(k)p_d^i(\mathbf{x})}{ \sum_{i=1}^n p_d^i(\mathbf{x})+p_g(\mathbf{x})}, k=1,2,\cdots,K. \end{aligned} $$
(6.86)

Proof

According to (6.81) and (6.83) and given any generator G, we have:

$$\displaystyle \begin{aligned} &\widetilde{V}(G,D)=\sum_{i=1}^n E_{\mathbf{x}\sim p_d^i}\Big [log (1-P_D(K+1|\mathbf{x}))\Big]+E_{\mathbf{x}\sim p_g}\Big[log P_D(K+1|\mathbf{x})\Big]+\\ & \sum_{i=1}^n\sum_{k=1}^Kp_i(k)E_{\mathbf{x}\sim p_d^i}\Big [log \tilde{p}_{D}(k|\mathbf{x})\Big ] =\int\Big\{\sum_{i=1}^n p_d^i(\mathbf{x})\Big[log\big[\sum_{k=1}^KP_D(k|\mathbf{x})\big]+\\ & \sum_{k=1}^Kp_i(k)log \frac{P_D(k|\mathbf{x})}{1-P_D(K+1|\mathbf{x})}\Big]+ p_g(\mathbf{x})log\Big[1-\sum_{k=1}^KP_D(k|\mathbf{x})\Big]\Big\}\mbox{d}\mathbf{x}\\ \end{aligned} $$
(6.87)

By taking the derivative of the integrand, we find the maximum in [0, 1] as that in (6.86). □

Remark (Beyond the Incontinuity of p g)

According to [2], the problematic scenario is that the generator is a mapping from a low dimensional space to a high dimensional one, which results in the density of p g(x) infeasible. However, based on the definition of \(\tilde {p}_{D}(y|\mathbf {x})\) in (6.82), we have:

$$\displaystyle \begin{aligned} \tilde{p}_{D^*}(y|\mathbf{x})\!=\!\frac{\sum_{i=1}^n p_i(y)p_d^i(\mathbf{x})}{ \sum_{i=1}^n p_d^i(\mathbf{x})}\!=\!\sum_{i=1}^n w_i(\mathbf{x})p_i(y). \end{aligned} $$
(6.88)

Hence, our final classifier does not depend on p g(x), and (6.88) explicitly expresses the weights of the aggregation.

Remark (Relationship to One-Side Label Smoothing)

Notice that the optimal discriminator D is also related to the one-sided label smoothing mentioned in [54], which was inspirited by [64] and shown to reduce the vulnerability of neural networks to adversarial examples [73].

In our model, we only smooth labels of real data (multi-class classifier) in the discriminator by setting the targets as the holistic proportions (the prior) p i(y) in corresponding bags.

2.2.2.3 C. The Objective Function of Generator

Normally, for the generator, we should solve the following optimization problem with respect to p g.

$$\displaystyle \begin{aligned} \min_{G}\widetilde{V}(G,D^*)=\min_{G} E_{\mathbf{x} \sim p_g}logP_{D^*}(K+1|\mathbf{x}). \end{aligned} $$
(6.89)

If denoting \(C(G)=\max _D \widetilde {V}(G,D)=\widetilde {V}(G,D^*)\), because \(\widetilde {V}(G,D)\) is convex in p g and the supremum of a set of convex function is still convex, we have the following conclusion.

Theorem 6.7

The global minimum of C(G) is achieved if and only if \(p_g=\frac {1}{n}\sum _{i=1}^n p_d^i\).

Proof

Denote \(p_d=\sum _{i=1}^n p_d^i\). Hence, according to Theorem 6.6, we can reformulate C(G) as:

(6.90)

where JSD(⋅∥⋅) and CE(⋅, ⋅) are the Jensen-Shannon divergence and cross entropy between two distributions, respectively. However, note that p d is a summation of n independent distributions, so \(\frac {1}{n}p_d\) is a well-defined probabilistic density. Then, we have:

(6.91)

That concludes the proof. □

Remark

When there is only one bag, the first two terms in (6.91) will degenerate as nlog(n) − (n + 1)log(n + 1) = −2log2, which adheres to results in original GANs. On the other hand, the third term manifests the uncertainty on instance label, due to the concealment in the form of proportion.

Remark

According to the analysis above, ideally, we can obtain the Nash equilibrium between the discriminator and the generator, i.e. the solution pair (G , D ) satisfies:

(6.92)

However, as shown in [13], a well-trained generator would lead to the inefficiency of supervised information. In other words, the discriminator would possess the same generalization ability as merely training it on L prop. Hence, we apply feature matching (FM) to the generator, and obtain its alternative objective by matching the expected value of the features (statistics) on an intermediate layer of the discriminator [54]: \(L(G) = \|E_{\mathbf {x} \sim \frac {1}{n}p_d}f(\mathbf {x}) - E_{\mathbf {x} \sim p_g }f(\mathbf {x})\|{ }_2^2\). In fact, FM is similar to the perceptual loss for style transfer in a concurrent work [26] and the goal of this improvement is to impede the “perfect” generator resulting in unstable training and discriminator with low generalization.

2.2.2.4 D. LLP-GAN Algorithm

So far, we have clarified the objective functions of both discriminator and generator in LLP-GAN. In particular, note that we execute Monte-Carlo sampling for the expectations. When accomplishing the training stage in GAN manner, the discriminator can be put into effect as the final classifier.

The strict proof for algorithm convergence is similar to that in [17]. Because \(\max _D \widetilde {V}(G,D)\) is convex in G and the subdifferential of \(\max _D \widetilde {V}(G,D)\) contains that of \(\widetilde {V}(G,D^*)\) in every step, the exact line search method gradient descent converges [7]. We present the LLP-GAN algorithm as follows.

Algorithm 6.16 LLP-GAN training algorithm

2.3 Learning from Label Proportions on High-Dimensional Data

2.3.1 Background

In this subsection, the random forests which is used for our classification is presented.

Random forests are an ensemble learning method together with a bagging procedure for classification and other tasks, where each basic classifier is a decision tree and each tree depends on a collection of random variables. More specifically, during splitting of a randomized tree, each decision node randomly selects a set of features and then picks the best among them according to some quality measurement (e.g., information gain or Gini index) [53]. Furthermore, as each tree in the forest is built and tested independently from other trees, the overall training and testing procedures can be performed in parallel [31].

We denote the mth tree of random forests as f(x, θ m), where θ m is a random vector representing the various stochastic elements of the tree. Meanwhile, let p m(k|x) represent the estimated density of class labels for the mth tree and M be total number of the trees in the forests. In practice, the final prediction results of random forests are given by probability towards different classes. As a result, the estimated probability for predicting class k in random forests can be defined as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} F_k(\mathbf{x}) = \frac{1}{M}\sum_{m=1}^M p_m(k|\mathbf{x}), k \in \gamma = \{1,2,\ldots,K\}, \end{array} \end{aligned} $$
(6.93)

where K is the total number of classes. In particular, a decision can be made by simply taking the maximum over all individual probabilities of the trees for a class k with

$$\displaystyle \begin{aligned} \begin{array}{rcl} C(\mathbf{x}) = \arg \max_{k\in \gamma} F_k(\mathbf{x}), \gamma = \{1,2,\ldots,K\} \end{array} \end{aligned} $$
(6.94)

where the final result of C(x) is the index of the corresponding class.

The classification margin measures the extent to which the average number of votes for the right class exceeds the average for any other class, which is introduced by Breiman [8], and is expressed as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} mg(\mathbf{x},y) = F_y(\mathbf{x}) - \max\limits_{k\neq y}F_k(\mathbf{x}). \end{array} \end{aligned} $$
(6.95)

Obviously, if the classification is correct, there should be mg(x, y) > 0. In other words, the larger the margin is, the more confidence in the classification. The generalization error of random forests is in form of:

$$\displaystyle \begin{aligned} \begin{array}{rcl} GE = E_{(X,Y)}(mg(\mathbf{x},y) < 0), \end{array} \end{aligned} $$
(6.96)

where the expectation is measured over the entire distribution of (X,Y).

Random forests have shown its advantages in both classification [8] and clustering [45]. In particular, experiments have shown that high accuracy can be achieved by random forests when classifying high dimensional data [3]. Meanwhile, Caruana [9] presented an empirical evaluation on high dimensional data of different methods, and found that random forests perform consistently well across all dimensions compared with other methods. Additionally, it is easy for random forests to be parallelized, which makes them very easy for multi-core and GPU implementations. Sharp [56] have show that GPU can accelerate the random forests and have great advantage compared to CPU in processing speed, which is very useful for practical applications. Recently, random forests have been applied in video segmentation [49], object detection [15], image classification [6] and remote sensing [46] due to its advantages.

2.3.2 The LLP-RF Algorithm

In this subsection, we present a novel learning from label proportions algorithm called LLP-RF, which use random forests to solve high-dimensional LLP problem. In order to leverage random forests to LLP, the hidden class labels insides bags are defined as the optimization variables. Meanwhile, we formulate a robust loss function based on random forests and take the corresponding proportion information into LLP-RF by penalizing the difference between the ground-truth and estimated label proportion. A binary learning setting is considered in the following.

2.3.2.1 A. Learning Setting

Similar to the standard supervised learning, the problem is also described by a set of training data. But the training data of LLP is only provided in form of bags and the ground-truth labels of training data are not available. In this paper, we assume the bags are disjoint. Let B i, i = 1, …, n denote the ith bag in the training set. As a result, the total training data can be expressed as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} D = B_1 \cup B_2 \cup \ldots. \cup B_n \\ B_i \cap B_j = \emptyset,\forall i \neq j,\end{array} \end{aligned} $$
(6.97)

where the total number of training data is N. The ith bag consists of m i instances and is in form of:

$$\displaystyle \begin{aligned} \begin{array}{rcl} B_i = \{x_i^1, . . .,x_i^{m_i}\}\{p_i\}, i \in \{1,2,\ldots,n\}, \end{array} \end{aligned} $$
(6.98)

where the associated p i indicates the label proportion of the ith bag. As a result, the jth instance in the ith bag can be expressed as \(x_i^j\).

The ground-truth labels of instances are modeled as y = (y 1, …, y N)T, where y i is the unknown label of x i. Furthermore, we can define the proportion of ith bag as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} p_i = \frac{|\{k|k\in B_i,y_k^* =1\}|}{|B_i|} , \forall k \in \{1,2,\ldots,N\}, \end{array} \end{aligned} $$
(6.99)

in which \(y_k^* \in \{1,-1\} \) is the unknown ground-truth label of x k and |B i| denotes the bag size of ith bag. In practice, the above formulation is equivalent to the following:

$$\displaystyle \begin{aligned} \begin{array}{rcl} p_i = \frac{\sum_{k \in B_i}y_k^*}{2|B_i|} + \frac{1}{2} , \forall k \in \{1,2,\ldots,N\}. \end{array} \end{aligned} $$
(6.100)
2.3.2.2 B. The LLP-RF Framework

The above LLP learning setting is very intuitive and the final objective is to train a classifier in the instance level. To this end, inspired by [32], we formulate a robust loss function based on random forests and take the corresponding proportion information into LLP-RF by penalizing the difference between the ground-truth and estimated label proportion. Therefore, the final objective function of LLP-RF is formulated as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \arg\min_{F(\cdot),y_i^j} & &\displaystyle C\sum_{i=1}^n\sum_{j=1}^{m_i}L[F_{y_i^j}(x_i^j)] + C_p\sum_{i=1}^nL_p[p_i(\mathbf{y}),p_i]\\ s.t. & &\displaystyle \forall_{i=1}^n,\forall_{j=1}^{m_i} \quad y_i^j\in\{1,-1\},{} \end{array} \end{aligned} $$
(6.101)

where the hidden class labels y are defined as the optimization variables and the task is to simultaneously optimize the labels y and the model F().

Specifically, L() is a loss function which is defined over the entire set of instances and L p() is a loss function used to penalize the difference between the ground-truth label proportion and the estimated label proportion based on y. Different weights can be added for the loss of bag proportions by changing the value of C p.

Note that F k(x) is the confidence of classifier for the kth class, which is got from random forests.

Furthermore, our proposed framework permits choosing different loss functions for L(). In our paper, different loss function including hinge loss, logistic loss and entropy are tuned to obtain better classification results. In this paper, we consider L p() as the absolute loss:

$$\displaystyle \begin{aligned} \begin{array}{rcl} L_p[p_i(\mathbf{y}),p_i] = |p_i(\mathbf{y}) - p_i|, \end{array} \end{aligned} $$
(6.102)

where p i is the true label proportion of ith bag and p i(y) is the estimated label proportion of ith bag.

The above LLP-RF framework is fairly straightforward and intuitive. However, it leads to a non-convex integer programming problem because it needs to simultaneously optimize the labels \(y_i^j\) and trains a random forest. In practice, the problem is often NP-hard. Therefore, one key issue is how to solve the optimization problem efficiently. In this paper, a simple but efficient alternating optimization strategy based on annealing is employed to minimize the overall learning objective.

2.3.2.3 C. How to Solve the LLP-RF

The strategy to solve (6.101) is similar to the rule from [80]. There are two variables F and y in the optimization formula, where the unknown instance labels y can be seen as a bridge between supervised learning loss and label proportion loss. Therefore, we solve the problem by alternating optimizing the two variables F and y.

  • We fix the y. The optimization problem becomes a native random forests problem, which can be expressed as below:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \arg\min_{F(\cdot)}C\sum_{i=1}^n\sum_{j=1}^{m_i}L[F(x_i^j)]. \end{array} \end{aligned} $$
    (6.103)
  • Then, F is fixed. The problem can be transformed to the following:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \arg\min_{y_i^j} & &\displaystyle C\sum_{i=1}^n\sum_{j=1}^{m_i}L[F_{y_i^j}(x_i^j)] + C_p\sum_{i=1}^nL_p[p_i(\mathbf{y}),p_i] \\ s.t.& &\displaystyle \forall_{i=1}^n,\forall_{j=1}^{m_i} \quad y_i^j\in\{1,-1\}. {} \end{array} \end{aligned} $$
    (6.104)

The first term of the objective is defined over the entire instances. However, the proportion information p i of the second term is provided in the bag level. In order to use the proportion information efficiently, the above formula can be written to the following:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \arg\min_{y_i^j} & &\displaystyle \sum_{i=1}^n \Big\{C\sum_{j=1}^{m_i}L[F_{y_i^j}(x_i^j)] + C_pL_p[p_i(\mathbf{y}),p_i] \Big\} \\ s.t.& &\displaystyle \forall_{i=1}^n,\forall_{j=1}^{m_i} \quad y_i^j\in\{1,-1\}. {} \end{array} \end{aligned} $$
(6.105)

As the bags are disjoint to each other, the contribution of each bag to the objective is independent. As a result, the objective can be optimized on each bag separately and the final result is equivalent to the summation of every bag. In particular, solving \(\{y_i^j|j\in B_i\}\) yields the following optimization problem:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \arg\min_{\{y_i^j|j\in B_i\}}& &\displaystyle C\sum_{j\in B_i}\ell[F_{y_i^j}(x_i^j)] + C_pL_p[p_i(\mathbf{y}),p_i] \\ s.t.& &\displaystyle \forall j\in B_i,\quad y_i^j\in {\{1,-1\}}. {} \end{array} \end{aligned} $$
(6.106)

Obviously, the original optimization problem has changed to solve the formula (6.106), whose solution can be found by the following optimization strategy.

Remark

The steps for solving formula (6.106).

  • Compute all the possible values of the second term in formula (6.106), where there are total |B i| + 1 values. In practice, the kth value can be expressed as:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} F_2(k) = |\frac{k-1}{|B_i|} - p_i|, k \in \{1,2, \ldots ,|B_i|,|B_i| +1\}. \end{array} \end{aligned} $$
    (6.107)
  • Obtain all the values of first term F 1(k) corresponding to the second term F 2(k).

  • Pick the smallest objective value from

    $$\displaystyle \begin{aligned} \begin{array}{rcl} C*F_1(k) + C_p*F_2(k), k \in \{1,2,\ldots,|B_i|,|B_i| +1\}, \end{array} \end{aligned} $$
    (6.108)

    yielding the optimal solution of (6.106).

The above strategy is fairly intuitive and straightforward. The main focus is how to obtain the value of first term corresponding to the second term. In practice, there are total |B i| + 1 values about the second term. For a fixed value of second term, steps can be taken as Proposition 6.4.

Proposition 6.4

For a fixed p i(y) = θ, we can find the solution of (6.106) by the iterative steps as below.

  • Initialize \(y_i^j = -1, \forall j\in \{1,2,\ldots ,|B_i|\}\) , where |B i| is the number of instances in ith bag.

  • Compute the value of \(\ell [F_{-1}(x_i^j)]), j \in \{1,2,\ldots ,|B_i|\}\).

  • Flip the sign of \(y_i^j = 1, \forall j\in \{1,2,\ldots ,|B_i|\}\).

  • Compute the value of \(\ell [F_1(x_i^j)]), j \in \{1,2,\ldots ,|B_i|\}\).

  • Let \(\delta _i^j = C(\ell [F_1(x_i^j)] - \ell [F_{-1}(x_i^j)]), j \in \{1,2,\ldots ,|B_i|\}\) denote the reduction of the first term in (6.106) through flipping the sign of \(y_i^j\).

  • Sort \(\delta _i^j, \forall j\in \{1,2,\ldots ,|B_i|\}\) in descending way. Then flip the signs of \(y_i^j\) of the top-R (R = θ|B k|) which have the highest reduction. For each bag, we only need to sort the \(\delta _i^j, \forall j\in \{1,2,\ldots ,|B_i|\}\) once.

Obviously, the minimum value of each bag and the corresponding y can be obtained using the above steps. In detail, the solution process of the LLP-RF model can be concluded to the following two alternative steps: solve random forests optimization problems and renovate the labels of y until the objective function value is no longer changing or the reduction of objective is smaller than a threshold. The details of the process are shown in Algorithm 6.17.

Furthermore, in order to avoid the local solutions, similar to T-SVM [10] and SVM [80], the novelly proposed LLP-RF algorithm also takes an additional annealing loop to gradually increase C. The annealing can be seen as a step to avoid the local optimal solution. In detail, the annealing loop is achieved by the following equation \(C^* = \min \{(1+\triangle )C^*,C\},\) where △ is a step to control the increase of C. Throughout this work, we set △ = 0.5.

In practice, the different values of initializing y can lead to different results. In order to reduce the randomness, we should repeat the process several times and pick the smallest objective value as the final result.

Algorithm 6.17 LLP-RF

2.4 Learning from Label Proportions with Pinball Loss

2.4.1 Preliminary

In this subsection, we introduce the basic formulation of learning from label proportions and give corresponding symbol description.

In learning from label proportions, although the proportion of each bag is given, the label of each instance is unknown. Suppose we are given a sample set \(\{x_{i}, y_{i}^{*}\}_{i=1}^{N}\), where \(x\in \mathcal {R}^{n}\) and \(y_{i}^{*} \in \{1, 1\}\) denotes the unknown ground truth label of x i. The sample set is grouped into K bags. In this subsection, we assume that the bags are disjoint.

The ground truth label proportion of the k-th bag S k can be defined as

$$\displaystyle \begin{aligned} P_{k}:=\frac{|\{i|i\in S_{k}, y_{i}^{*}=1 \}|}{|S_{k}|}. \end{aligned} $$

The goal is to find a decision function f(x) = sign(w T ϕ(x) + b) such that the label y for any instance x can be predicted, where ϕ(⋅) is a map of the input data.

Assume the instance labels are explicitly modeled as \(\{y_{i}\}_{i=1}^{N}\), where y i ∈{1, 1}. The modeled label proportion of the k-th bag can be defined as

$$\displaystyle \begin{aligned} P_{k}=\frac{|\{i|i\in S_{k}, y_{i}=1 \}|}{|S_{k}|}. \end{aligned} $$

The Learning from label proportions model can be formulated as below:

$$\displaystyle \begin{aligned} & \min_{y,w,b} \frac{1}{2} \|w\|{}^2 + C \sum_{i=1}^N L_{\tau}+C_{2} \sum_{k=1}^K |p_{k}(y)-P_{k}|, \\ & \mbox{s.t.}\ y_{i} \in \{-1,1\}, {} \end{aligned} $$
(6.109)

in which L τ(⋅) is the supervised loss function. Notice the instance labels y is also a variable, which can be seen as a bridge between empirical loss and label proportion loss.

Here, we first discuss the noise generated in the framework of learning from label proportions, and introduce pinball loss to address this issue. Next, we give the learning from label proportions model with pinball loss. Also, the dual problem is given. Then, an alternating optimization method is applied to solve the proposed model. Finally, the complexity of our method is discussed.

2.4.2 Noise and Pinball Loss

Unlike traditional hinge loss, pinball loss pushes the surfaces that define the margin to quantile positions by penalizing also the correctly classified sampling points [24]. The distance between these two classes is easily affected by the noise on feature x. Also, improper initialization of label y causes noise as well. As a result, the classifier with hinge loss is sensitive to feature noise. The pinball loss is related to quantiles and has been well studied in regression (parametric methods [52] and nonparametric methods [12, 63]. And it is also used for binary classification recently [23].

The pinball loss is defined as follows:

$$\displaystyle \begin{aligned} &L_{\tau}(u)\ \ = \begin{cases} u, &\!\!\! u\geq 0,\\ -\tau u, &\!\!\! u< 0. \end{cases} \end{aligned} $$
(6.110)

Particularly, when τ = 0, the pinball loss L τ(u) reduces to the hinge loss. When a positive τ is used, minimizing the pinball loss results in the quantile value.

To intuitively show the properties of pinball loss, we are going to compare the classifiers based on the hinge loss and the pinball loss, respectively. Here, let’s consider a two dimensional example: points are generated from two Gaussian distribution N(μ 1, σ) and N(μ 2, σ), where μ 1 = [0.5, −3]T, μ 2 = [0.5, 3]T and σ = [0.1, 0;0, 2]. As shown in Fig. 6.7, the solid lines indicate the classification hyperplane achieved by classifier based on the hinge loss and the dashed lines represent the hyperplane obtained by pinball loss. The data points are generated from the same distribution. However, the hinge loss classifier obtains the significantly different results while the pinball loss hyperplane achieve more stable results. It is mainly because that the hinge loss classifier measures the distance between two sets by the nearest points. But pinball loss takes the nearest τ (e.g. 35%) points to measure this distance, which makes its result less sensitive to noise around the boundary.

Fig. 6.7
figure 7

Comparison between the classifiers based on hinge loss and pinball loss. As it is shown, the results of pinball loss classifier are more stable

2.4.3 Learning from Label Proportions Model with Pinball Loss

With pinball loss, we can formulate the learning from label proportions model as below:

$$\displaystyle \begin{aligned} & \min_{y,w,b} \frac{1}{2} \|w\|{}^2 + C \sum_{i=1}^N L_{\tau}(1-y_{i}(w^{T}\phi (x_{i})+b))+C_{2} \sum_{k=1}^K |p_{k}(y)-P_{k}|, \\ & \mbox{s.t.}\ y_{i} \in \{-1,1\}. {} \end{aligned} $$
(6.111)

As the instance labels y is also a variable, one natural way for solving Eq. (6.111) is via alternating optimization.

Step 1

For a fixed y, the optimization of Eq. (6.109) w.r.t w and b becomes a classic SVM with pinball loss:

$$\displaystyle \begin{aligned} \min_{w,b} \frac{1}{2} \|w\|{}^2 + C \sum_{i=1}^N L_{\tau}(1-y_{i}(w^{T}\phi (x_{i})+b)).{} \end{aligned} $$
(6.112)

Step 2

When w and b are fixed, the problem becomes:

$$\displaystyle \begin{aligned} & \min_{y} \sum_{i=1}^N L_{\tau}(1-y_{i}(w^{T}\phi (x_{i})+b))+\frac{C_{2}}{C} \sum_{k=1}^K |p_{k}(y)-P_{k}|, \\ & \mbox{s.t.}\ y_{i} \in \{-1,1\}. {} \end{aligned} $$
(6.113)

By taking the strategy presented in [80], we show that the second step above can be solved efficiently. Since the influence of each bag on the objective is independent, we can optimize Eq. (6.113) on each bag separately. For a fixed p k(y) = θ, Eq. (6.113) can be optimally solved by the steps below.

  • Initialize y i, i ∈ B k.

  • Suppose the reduction of the first term in (6.113) is δ i. Sort δ i, i ∈ B k.

  • Flip the signs of the top-R y i which have the highest reduction δ i, where R = θ|B k|.

    By conducting Step 1 and Step 2 alternately until the decrease of objective is smaller than a threshold (e.g. 10−4), we can obtain the optimal solution.

2.4.4 Dual Problem

The problem in Eq. (6.112) can be transformed into:

$$\displaystyle \begin{aligned} &\min_{w,b} \frac{1}{2} \|w\|{}^2 + C \sum_{i=1}^N \xi_{i},\\ & \mbox{s.t.}\ y_{i}(w^{T}\phi(x_{i})+b) \geq 1-\xi_{i}, i=1,2,\cdots,N,\\ &\ y_{i}(w^{T}\phi(x_{i})+b) \leq 1+\frac{1}{\tau}\xi_{i}, i=1,2,\cdots,N. {} \end{aligned} $$
(6.114)

According to the Karush-Kuhn-Tucker (KKT) sufficient and necessary optimality conditions, the dual problem of Eq. (6.114) is obtained as follows,

$$\displaystyle \begin{aligned} &\max_{\alpha,\beta} - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N (\alpha_{i} - \beta_{i})y_{i}\phi(x_{i})^{T}\phi(x_{j})y_{j}(\alpha_{j} - \beta_{j}) + \sum_{i=1}^N (\alpha_{i} - \beta_{i}),\\ & \mbox{s.t.}\ \sum_{i=1}^N (\alpha_{i} - \beta_{i})y_{i}=0,\\ &\ \alpha_{i} + \frac{1}{\tau}\beta_{i}=C,\ i=1,2,\cdots,N,\\ &\ \alpha_{i}\geq 0,\ i=1,2,\cdots,N,\\ &\ \beta_{i}\geq 0,\ i=1,2,\cdots,N. {} \end{aligned} $$
(6.115)

Introduce the variables γ i, α i and β i. Let γ i = α i − β i. The dual problem Eq. (6.115) has the same solution set w.r.t. α as that to the following convex quadratic programming problem:

$$\displaystyle \begin{aligned} &\min_{\gamma,\beta} \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \gamma_{i} y_{i}\phi(x_{i})^{T}\phi(x_{j})y_{j}\gamma_{j} - \sum_{i=1}^N \gamma_{i},\\ & \mbox{s.t.}\ \sum_{i=1}^N \gamma_{i}y_{i}=0,\\ &\ -\tau C \leq \gamma_{i}\leq C,\ i=1,2,\cdots,N. {} \end{aligned} $$
(6.116)

Suppose \(\gamma ^{*}=(\gamma ^{*}_{1},\gamma ^{*}_{2},\ldots ,\gamma ^{*}_{l})\) is the solution to problem Eq. (6.116). We can have

$$\displaystyle \begin{aligned} & w^{*}= \sum_{i=1}^N \gamma^{*}_{i}y_{i} \phi (x_{i}), \mbox{and}\\ & b^{*}= y_{j}-\sum_{i=1}^N y_{i} \gamma^{*}_{i} \phi (x_{i})^{T} \phi (x_{j}), \end{aligned}$$

where \(\forall j: -\tau C < \gamma _{j}^{*} < C\).

Then the obtained function can be represented as

$$\displaystyle \begin{aligned} f(x) = \sum_{i=1}^N y_{i} \gamma^{*}_{i} \phi (x_{i})^{T} \phi (x_{j}) +b^{*}, \end{aligned}$$

where \(\forall j: -\tau C < \gamma _{j}^{*} < C\).

2.4.5 Overall Optimization Procedure

Based on the detailed explanation above, the overall optimization procedure is summarized in Algorithm 6.18.

Algorithm 6.18 Optimization procedure of learning from label proportions

By alternating between solving w , b and y, the objective is guaranteed to converge, for the reason that the objective function is lower bounded, and non-increasing. Empirically, the alternating optimization typically terminates fast within ten iterations.

In practice, the stopping criterion of the overall optimization procedure is that the objective function does not decrease any more (or if its decrease is smaller than a threshold).

2.4.6 Complexity

Step 1 takes the complexity of SVM with pinball loss. As described in the paper, the bags are disjoint, the influences of the bags are independent. In Step 2, for each bag S k, sorting takes O(|S k|log(|S k|)), which is same with [80]. Overall, the complexity is \(O(\sum _{k=1}^K |S_{k}|log(|S_{k}|))\). We know that \(\sum _{k=1}^K|S_{k}|=N\) and denote J =maxk=1,2,…,K|S k|. The complexity is O(Nlog(J)) time.

3 Other Enlarged Learning Models

3.1 Classifying with Adaptive Hyper-Spheres: An Incremental Classifier Based on Competitive Learning

3.1.1 Basic Theory

3.1.1.1 A. Basic Theory of Supervised Competitive Learning

We partially borrow the topological structure of CPN to introduce our model. CPNs are a combination of competitive networks and Grossberg’s outstar networks [19]. The topological structure of CPN has three layers: input layer, hidden layer, and output layer (Fig. 6.8).

Fig. 6.8
figure 8

Topological structure of CPN.

Suppose there are N elements in the input layer, M neurons in the hidden layer, and L neurons in the output layer. Let vector V i = (v i1, …, v iN)T denote the weights of neuron i in the hidden layer connecting to each of the elements of the input layer. Then V = (V 1, …, V M) denotes weight matrix of the instars. If the training in stage 1 can be viewed as a clustering process, then neuron i is cluster c i and V i is the centroid of cluster c i.

When an instance is coming, it will compute the proximity between the instance and each V i in the weight matrix, i.e., the centroid of cluster c i. Here, proximity can be measured by computing inner product \(net_{j}=V_{j}^{T} x, (j=1,2,\ldots ,m)\). It adopts a winner-takes-all strategy to determine which neuron’s weights are to be adjusted. The winner is \(net_{j^{*}}=\max \{net_{j}\}\). In other words, the winner is \(c_{j^{*}}\) whose centroid is the closest to the incoming instance. The winning neuron’s weights would be adjusted as follows:

$$\displaystyle \begin{aligned} V_{j^{*}}(t+1) = V_{j^{*}}(t)+\alpha [x-V_{j^{*}}(t)], \end{aligned}$$

where α is the learning rate, indicating that the centroid of the winning cluster will move in the direction of x. As instances keep coming, the weights vector—i.e., the centroid of the hyper-spheres—tend to move toward the densest region of the space. This first stage of the CPN’s training algorithm is a process of self-organizing clustering, although it is structured using a network.

The second part of the structure is a Grossberg learning [19]. We will redesign a different hidden layer and different connection from the hidden layer to the output layer.

3.1.1.2 B. Advantages and Disadvantages of the Original Model

To illustrate the advantage and disadvantage of original model, a set of two-dimensional artificial data were created and visualized in Fig. 6.9.

Fig. 6.9
figure 9

Artificial datasets and the proposed clustering solutions

In Fig. 6.9a, instances can be grouped into six clusters. Setting the number of neurons in the hidden layer to six, the first training stage of the model in Fig. 6.8 can automatically find the centroids of the six clusters, which are represented by the weights of the six neurons. The second training stage can learn each cluster’s connection to the right class. The distance from each instance in Fig. 6.9a to its cluster centroid is smaller than the distances to the centroids of other clusters. The dataset shown in Fig. 6.9 is ideal for CPN to classify.

Data distribution in Fig. 6.9a is simplified and idealistic. Data with distribution similar to Fig. 6.9b will cause two kinds of problems to the original model.

  1. (1)

    First, the self-organized clustering process depends on the similarity measures between data points and hyper-sphere’s centroid. Points closer to one cluster’s centroid may belong to another cluster. Therefore, every cluster should have a definite scope or radius, and the scope should be as far away from others as possible.

  2. (2)

    Second, the number of clusters in the hidden layer is fixed in the original model. However, it is difficult to estimate the number of clusters in advance. Given different numbers of neurons in the hidden layer, the accuracy varies dramatically. The training of the instar layer-i.e., the clustering process-is contingent on this fixed number.

3.1.1.3 C. Building of the DMZ

To solve the first aforementioned problem, we should have a general knowledge of the scope of the clusters. For example, points of cluster A (in Fig. 6.9b) near the border may be closer to the centroid of cluster B, so these points will be considered belong to cluster B in the original model. We must identify the decision border that separates clusters according to their labels. When two instances with conflicting labels fall into the same cluster, it gives us an opportunity to identify the border point that is somewhere between the two conflicting instances (as long as the instance is not an outlier). To maintain the maximum margin and for the sake of simplicity, the median point of two instances could be selected as a point in a zone called a Demilitarized Zone (DMZ), and clusters should be as far away from the DMZ as possible. As the number of conflicting instances increases, a general zone gradually forms as the DMZ. This mechanism can find borders of any shapes that are surrounded by many hyper-spheres.

To solve the second problem, the number of clusters should not be predetermined. The clusters should be formed dynamically and merged or split if necessary. The scope of the hyper-spheres, represented by the corresponding radii, should be adjusted on demand. As an example, consider the situation presented in Fig. 6.9b: with instances of conflicting labels found in the top cluster, the original cluster should tune its radius. After training, a new cluster would be formed beneath the top cluster containing instances of different labels from the ones in the top cluster. The radii of the two clusters should be tuned according to their distance to the borders.

One single hyper-sphere may not enclose an area whose shape is not hyper-spherical [51]. However, any shape could be enclosed as long as the number of the formed hyper-spheres is unlimited. Consider the clusters represented by the two-dimensional circles in Fig. 6.9c. All of the instances can be clustered no matter what the data distribution is and what the shape of the border is, as long as there are enough hyper-spheres of varying radii and are properly arranged.

3.1.1.4 D. Proposed Topological Structure

Given the solutions above, the structure of our improved model is as follows (Fig. 6.10):

Fig. 6.10
figure 10

Topological structure of the proposed model

The first difference is that our model has an adaptive dynamic hidden layer and the number of neurons in hidden layer is adaptive. The second difference is that each neuron H i connects to only one particular neuron in the output layer, and w ij is used to record the radius of neuron H i .

3.1.1.5 E. Kernelization

It is challenging for competitive learning models to apply kernel methods because they cannot be denoted in inner-product forms. Some previous studies use approximation methods for the kernelization of competitive learning [29, 76]. This paper uses Nyström method to kernelize the proposed model [28, 40].

Let the kernel matrix written in blocks form:

$$\displaystyle \begin{aligned} A = \left[ \begin{array}{cc} A_{11} &A_{12}\\ A_{21} &A_{22} \end{array} \right], \end{aligned}$$

Let \(C=\left [A_{11}\ A_{12}\right ]^{T}\), Nyström method uses A 11 and C to approximate large matrix A. Suppose C is a uniform sampling of the columns, Nyström method generates a rank-k approximation of A(k ≤ n) and is defined by:

$$\displaystyle \begin{aligned} A^{nys}_{k} = CA^{+}_{11}C^{T}=\left[ \begin{array}{cc} A_{11} &A_{21}\\ A_{21} & A_{21}A_{11}^{+}A^{T}_{21} \end{array} \right]\approx A, \end{aligned}$$

where \(A_{11}^{+}\) denotes the generalized pseudo inverse of A 11.

There exists an Eigen decomposition \(A_{11}^{+}=V \varLambda ^{-1} V^{T}\) such that each element \({A_{k}^{nys}}_{ij}\) in \(A^{nys}_{k}\) can be decomposed as:

$$\displaystyle \begin{aligned} {A_{k}^{nys}}_{ij} &= ( C_{i}^{T} V \varLambda ^{-1} V^{T} C_{j} )\\ &=(\varLambda ^{-1/2} V^{T} C_{i})^{T} (\varLambda ^{-1/2} V^{T} C_{i})\\ &=(\varLambda ^{-1/2} V^{T} (\kappa(x_{i},x_{1}),\ldots, \kappa(x_{i},x_{m})))^{T}\bullet (\varLambda ^{-1/2} V^{T} (\kappa(x_{j},x_{1}),\ldots, \kappa(x_{j},x_{m}))), \end{aligned}$$

where κ(x i, x j) is the base kernel function, x 1, x 2, …, x m are representative data points and can be obtained by uniform sampling or clustering methods such as K-means and SOFM.

Let ϕ m(x) = Λ −1∕2 V T(κ(x, x 1), …, κ(x, x m))T, such that \({A_{k}^{nys}}_{ij}=\phi _{m}(x_{i})^{T} \phi _{m}(x_{j})=\kappa (x_{i},x_{j})\).

With Nyström method, we can get an explicit approximation of the nonlinear projection ϕ m(x), which is:

$$\displaystyle \begin{aligned} x \rightarrow \phi_{m}(x).{} \end{aligned} $$
(6.117)

To justify why we use kernel methods for our model, we first used Nyström method to raise the dimension of dataset 3 to 403, then used Singular Value Decomposition (SVD) to reduce the dimension to 2 for the purpose of visualization. Figure 6.11 illustrates the transformed dataset 3 from Fig. 6.9c.

Fig. 6.11
figure 11

Artificial dataset 3 after Nyström and SVD transformation

Compared with Fig. 6.9c, the data in Fig. 6.11 can be covered with less hyper-spheres, or each hyper-sphere can enclose more data points. Because the sampling points in Nyström methods can be obtained dynamically, the projection of Eq. (6.117) can be used for every single instance in competitive learning and can be applied directly to our incremental model.

Without loss of generality, we use ϕ m(x) to denote a potential projection of x in the reminder of this paper. If it works in the original space, the projection of x is to itself.

3.1.2 Proposed Classifier: ADA-HS

The main characteristic of the proposed model is to adaptively build hyper-spheres. Therefore, we call the model Adaptive Hyper-Spheres (AdaHS), and the version after Nyström projection is called Nys-AdaHS.

3.1.2.1 A. Training Stages

Our algorithms are trained in three stages, which are described below.

Stage 1. Forming Hyper-Spheres and Adjusting Centroids and Radii

  1. (1)

    Forming hyper-spheres and adjusting centroids

    Given that instances are read dynamically, there is no hyper-sphere at the beginning. The first instance inputted forms a hyper-sphere whose centroid is itself and initial radius is set to a large value. When a new instance is inputted and does not fall into any existing hyper-spheres, a new hyper-sphere will be formed in the same way. If a new instance falls into one or more existing hyper-spheres, the winner is the one whose centroid is the closest to the new instance. The winning cluster’s centroid is recalculated as:

    $$\displaystyle \begin{aligned} c_{i}(t+1)=c_{i}(t)+\alpha[\phi (x)-c_{i}(t)], \end{aligned}$$

    where x is the new inputted instance, c(t) is the original centroid of the hyper-sphere, c(t + 1) is the new centroid, and α is the learning rate.

    When the number of instances that fall within a particular hyper-sphere grows, its centroid tends to move toward the densest zone.

    In order to speed up the search of the winner, we build simple k-dimension trees for all hyper-spheres. With the knowledge of the radius, it is easy to figure out the upper and lower bounds of the selected k dimensions. In this way, it avoids extensive computation of all Euclidean distance of instance and hyper-sphere pairs.

  2. (2)

    Building decision border zone: DMZ

    The goal of this step is to find the DMZ’s median points that approximate the shape of the DMZ.

    We find the points using the following technique. The first time a labeled instance falls into a hyper-sphere, the hyper-sphere will be labeled using the label of this instance. If another instance with a conflicting label falls into the same hyper-sphere, it indicates that the hyper-sphere has entered the DMZ. We identify the nearest data point in the hyper-sphere to the newly inputted conflicting instance, and let p i represent the median point as follows:

    $$\displaystyle \begin{aligned} p_{i}=\frac{1}{2}(\phi (x_{conflicting})+c_{i}), \end{aligned}$$

    where ϕ(x conflicting), p i ∈ c i and p i is recorded and used in the posterior clustering process.

  3. (3)

    Adjusting the radii of hyper-spheres

    Once a DMZ point is found in a hyper-sphere, the radius of the hyper-sphere should be updated such that it does not enter the DMZ. The new radius of hyper-sphere c i should therefore be set as:

    $$\displaystyle \begin{aligned} r_{i}=d(p_{i},c_{i})-d_{safe}, \end{aligned}$$

    where d safe represents a safe distance at which a hyper-sphere should be from the closest DMZ point. And the logics of this stage are outlined in Algorithm 6.19 below.

    Algorithm 6.19 The forming of hyper-spheres and the adjusting of the centroids and radii

Stage 2. Merging Hyper-Spheres

Hyper-spheres may overlap with one another or even be contained in others. Therefore, after certain period of training, a merging operation should be performed. Suppose that we have two hyper-spheres, c A and c B, and the radii of them are not the same. Let c big =maxradius(c A, c B), c small =minradius(c A, c B), d t = d(c big, c small), and θ be the merging coefficient. If d t + r small ≤ r big + θ × r small, the prerequisite to merge is met. Then let r temp = d t + r small, and the new radius of the c big will be \(r_{new}=\max (r_{temp},r_{big})\).

The details of this stage are outlined in Algorithm 6.20.

Algorithm 6.20 Merging of hyper-spheres

Stage 3. Selecting Hyper-Spheres

Since the training process is entirely autonomous, the number of generated hyper-spheres could be large. Therefore, the final stage needs to select hyper-spheres.

There are three types of hyper-spheres that are prominent, which are described as follows:

  1. (1)

    The first type of hyper-spheres includes large number of instances. Because these are the fundamental hyper-spheres that contain most data points, they are marked as “Core Hyper-spheres”.

  2. (2)

    The second type of hyper-spheres has less instances but locates near the border. They are marked as “Support Hyper-spheres” because such hyper-spheres can be found by measuring the distance between hyper-spheres and the nearest DMZ points.

  3. (3)

    The third type of hyper-spheres has small number of instances and is far away from the border. These hyper-spheres can be discarded.

    To achieve high classification accuracy, both core hyper-spheres and support hyper-spheres should be selected. The logic of the third stage is outlined in Algorithm 6.21.

    Algorithm 6.21 Selection of hyper-spheres

3.1.2.2 B. Mini-Batch Learning and Distributed Computing

To make it applicable in large scale applications, we encapsulate the proposed algorithms into a Map-Reduce framework. We can collect the incoming instances as mini-batch set and then train them in MapReduce tasks. The computing model of the algorithms is illustrated in Fig. 6.12.

Fig. 6.12
figure 12

MapReduce computing model

The collected mini-batch instances can be encapsulated in key-value pairs and mapped into mapper tasks.

In each mapper tasks, the operations are based on instances. It queries local cache for every instance to find out in which hyper-spheres the instance falls, marks the winning hyper-sphere and the conflicting ones, and sents the hyper-spheres along with the description of the needed operations in another form of key-value<id, hyper-sphere> pairs.

In each reducer task, the operations are based on every hyper-sphere, which is aggregated according to the hyper-sphere id emitted from mapper tasks. The competitive learning can be conducted collectively with the aggregated instances. The tuning of a radius can be performed for only once with the closest conflicting instance, and it should find out the orphan points and return the tuned hyper-sphere at the end.

After a turn of the MapReduce tasks, the merging and selection of the hyper-spheres should be performed. After all of the operations, the tuned hyper-spheres should be saved to the cache. The orphan points should be retrained in the next turn. In the whole MapReduce process, sub-tasks do not coordinate with each other. Thus the hyper-spheres and DMZ are not updated in real time in a mini-batch turn, and they are updated collectively after all reducer tasks return.

3.1.2.3 C. Predicting Labels

Just like other supervised competitive neural networks, AdaHS must determine the winning hyper-sphere in the hidden layer to predict the label of a new instance. There are two situations. In the first situation, the new instance falls into an existing hyper-sphere and the label of the instance is determined by the label of the hyper-sphere. In the second situation, the new instance does not fall into an existing hyper-sphere, and the label of the new instance is coordinated by the k nearest hyper-spheres’ labels:

$$\displaystyle \begin{aligned} y=\operatorname*{\arg\max}_{l_{j}} \sum_{c_{i} \in N_{k}(x)} w_{j} I(y_{i}=l_{j}), \end{aligned}$$

where \(w_{j}=exp(-([d_{E}(\phi (x),c_{j})^2 ]/[2r_{j}^{2}]))\); i = 1, 2, …, L; j = 1, 2, …, k; N k(x) is the k nearest hyper-spheres; and I is the indicator function. The default value of k is set to 3.

3.2 A Construction of Robust Representations for Small Data Sets Using Broad Learning System

3.2.1 Review of Broad Learning System

This subsection is mainly a simple introduction to the BLS. The details of this system can be found in [11]. The BLS is designed based on the random vector functional-link neural network (RVFLNN) [25, 47]. In the BLS, the mapped features and enhancement features instead of the original features are used to feed into a single layer neural network. Figure 6.13 shows the structure of the BLS.

Fig. 6.13
figure 13

The structure of the BLS

In Fig. 6.13, X means the input features, and Y  means the corresponding labels. The label Y  uses one-hot encoding, which means all neurons are set to 0 except the one that belongs to the label is set to 1. The mapped features can be represented as follows:

$$\displaystyle \begin{aligned} Z_{i}=\phi_{i} (XW_{ei}+\beta_{ei}),{} \end{aligned} $$
(6.118)

where Z i is the i-th mapped features and W ei is the random weights. All the mapped features are concatenated as Z n ≡ [Z 1, Z 2, …, Z n], then the enhancement features can be represented as follows:

$$\displaystyle \begin{aligned} H_{j}=\xi_{j} (Z^{n} W_{hj}+\beta_{hj}).{} \end{aligned} $$
(6.119)

All the enhancement features are concatenated as H m ≡ [H 1, H 2, …, H m]. Therefore, the broad model can be represented as follows:

$$\displaystyle \begin{aligned} Y=[Z^{n}|H^{m}]W^{m},{} \end{aligned} $$
(6.120)

where W m = [Z n|H m]+ Y  is the weights of the single-layer neural network and can be easily calculated through the ridge regression approximation of [Z n|H m]+ using the following equation:

$$\displaystyle \begin{aligned} A^{+}=\lim_{\lambda \rightarrow 0} (\lambda I+ AA^{T})^{-1} A^{T}.{} \end{aligned} $$
(6.121)

Theoretically, the ϕ i(⋅) and ξ j(⋅) used in mapped features and enhancement features can be different functions. The sparse autoencoder is applied to fine-tune the W ei of mapped features, and the sigmoid function is used to generate enhancement features in [11].

3.2.2 Proposed BLS Framework and BLS with RLA

3.2.2.1 A. BLS Framework

To extend the BLS to a framework of transforming inputs into robust representations, feature extraction methods instead of random mapping are used to generate mapped features. Let Z i = ϕ i(X) denote the i-th mapped features, where ϕ i(⋅) can be any feature extraction method. Different feature extraction methods can generate different mapped features. Even if all mappings of mapped features use the same AE method, the mapped features are different due to the randomness of neural networks. All the mapped features are concatenated as Z n ≡ [Z 1, Z 2, …, Z n], and the ensemble of mapped features Z n can provide a robust representation of inputs.

The setting of a large number of enhancement nodes in the original BLS is removed. Deep representations, called enhancement features, are learned from the ensemble mapped features Z n. The enhancement features can be denoted as H j = ξ j(Z n), where ξ j(⋅) can be any feature extraction method. All the enhancement features are concatenated as H m ≡ [H 1, H 2, …, H m]. The concatenation of mapped features Z n and enhancement features H m can provide more robust representations to enhance the performance of downstream tasks.

Figure 6.14 shows the structure of the BLS framework. It should be noted that w and β used in (6.118) and (6.119) are random, so their method is random mapping. The mappings ϕ(⋅) and ξ(⋅) in Fig. 6.14 can be any feature extraction method, including random mapping, autoencoder, convolution feature extraction, recursive feature extraction, etc. Therefore, w and β are omitted in the new equations.

Fig. 6.14
figure 14

The structure of the BLS framework

Further, to generate more different mapped features and enhancement features in the BLS framework, samples and features can be randomly selected for each mapping. Figure 6.15 shows the structure of a random version of the BLS framework.

Fig. 6.15
figure 15

The structure of a random version of the BLS framework

3.2.2.2 B. BLS with RLA

Deep autoencoder (DA) is a nonlinear dimensionality reduction approach and usually works much better than PCA [20]. Instead of the unsupervised architecture used in DA, LA uses supervised architecture to connect the features and the labels together. The representation features learned from the LA not only contain the information of original features but also contain the estimated label belonging to the sample. In that case, the representation features can provide more information to the machine learning models and may promote the performance of these models. Figure 6.16 shows the structure of the LA.

Fig. 6.16
figure 16

The structure of the LA

In Fig. 6.16, X means the input features, and Y  means the corresponding labels using one-hot encoding. The label layer is added on the top of the representation layer, and a softmax function is used to forecast the label. The loss of the reconstruction of X is measured by the mean squared error (MSE), and the loss of forecasting label Y  is measured by the cross-entropy error (CEE). Therefore, the loss function of LA can be represented as follows:

$$\displaystyle \begin{aligned} loss = \frac{1}{n} \sum_{i=1}^{n}(\alpha (x_{i}-\hat{x}_{i})^{2}-\beta y_{i} log \hat{y}_{i}), \end{aligned}$$

where n is the number of samples, x i is the i-th sample, \(\hat {x}_{l}\) is the i-th reconstruction sample, y i is the i-th sample’s one-hot label, \(\hat {y}_{l}\) is the i-th sample’s softmax output, α and β are the scale factors.

To illustrate how the BLS framework works, LA is embedded in the BLS framework as an example. More specifically, the mappings ϕ i(⋅) and ξ j(⋅) in the BLS framework is the same feature extraction method, LA. The mapped feature Z n is learned by using the original data X as the input and output of the LA. After learning n mapped features, all the mapped features are concatenated as Z n. The enhancement feature H m is learned by using the Z n as the input and output of the LA, then all the mapped features and enhancement features are concatenated as the final input and fed into any machine learning model.

Because of the random initialization of the weights of LA, each mapped feature and enhancement feature will be different. Further, randomly picked samples and randomly picked features can be used as inputs for each LA, and the random label-based autoencoder (RLA) can generate more different mapped features and enhancement features in the BLS. The randomness of RLA is controlled by two parameters: selected sample size and selected feature size. If the selected sample size and the selected feature size are less than 1, samples and features are randomly picked according to these two selected sizes. If the selected sample size and the selected feature size are equal to 1, all samples and features are used to train RLA. Therefore, LA is a special case of RLA.

Given a two-layer encoder structure, the input fed into a machine learning model is as follows:

$$\displaystyle \begin{aligned} input &= [Z^{n} | H^{m}]\\ &=[Z_{1},\ldots,Z_{n} | H_{1},\ldots,H_{m}]\\ &=\left[ \begin{array}{c} \sigma (w_{11}\sigma(w_{12}x+b_{12})+b_{11}),\ldots, \sigma (w_{n1}\sigma(w_{n2}x+b_{n2})+b_{n1}) |\\ \sigma (w^{\prime}_{11}\sigma(w^{\prime}_{12}Z^{n}+b^{\prime}_{12})+b^{\prime}_{11}),\ldots, \sigma (w^{\prime}_{m1}\sigma(w^{\prime}_{m2}Z^{n}+b^{\prime}_{m2})+b^{\prime}_{m1}) \end{array} \right], \end{aligned}$$

where w is the weight, b is the bias, and σ(⋅) is the activation function.

The number of mapped features n and the number of enhancement features m are different and depend on the complexity of modeling problems. Additional mapped features and enhancement features can be added to achieve a better performance when the setting (n, m) cannot reach the desired accuracy. The pseudocode of the BLS with RLA is shown in Algorithms 6.22 and 6.23.

Algorithm 6.22 Broad learning: increment of additional enhancement features

Algorithm 6.23 Broad learning: increment of additional mapped features

In addition, it should be noted that the BLS is not conflicted with feature selection methods. The BLS can be used before or after the feature selection methods, and the selected features can also be concatenated with the mapped features and enhancement features.