1 Introduction

In natural language processing and information retrieval tasks, the bag-of-words (BoW) model is widely used, e.g., representing the semantics of a text or recording users’ query log. There are lots of mature technologies to generate a BoW. For example, (1) the keywords that are extracted from a document, where the keyword extractors include TextRank [32], RAKE [39] and TAKE [36], as well as some other extractors [3, 4, 18, 52]; (2) a word distribution obtained from a specific topic using topic model-based methods: LDA [8], hierarchical topic models [24] and structural topic models [38]; (3) a user’s query log recorded in a search engine [12, 25, 33].

However, a bag of words (BoW) is just a collection of scattered words, and is difficult to be understood by machines without explicit semantic explanation [15, 27, 35]. Therefore, an explicit semantic summary of a BoW is very helpful to understand a BoW. Many solutions have been proposed for this task, and the typical methods include (a) extraction-based methods, i.e., extracting the representative words from a BoW to summary it [19, 50], (b) conceptualization based methods, i.e., conceptualizing a BoW with several concepts [11, 21, 23, 26, 41, 44, 47,48,49, 51].

Particularly, conceptualization based methods have been widely studied, which generates a small set of concepts as labels to explicitly explain the semantics of a BoW, i.e., conceptual labeling (CL). Compared with the extraction-based methods, CL makes full use of the external knowledge (e.g., knowledge bases) to summarize the semantics of a BoW without information loss. In a typical CL solution [22, 41, 43], a BoW is first automatically divided into multiple groups according to their semantic relevance, and then each group is labeled with a concept that can specifically summarize the semantics. We present two examples:

  • white, black, color, dark bluecolor

  • apple, pear, banana, radish, potatofruit, vegetable

For human beings, the labels on the right are the concepts that come to our mind when given the words and phrases on the left. Therefore, the conceptual labels could help machines better understand a BoW, thus accurately dealing with the downstream tasks, e.g., reading comprehension [40], text summarization [17], and understanding the user intent for information retrieval [1].

The traditional CL methods just generate flat conceptual labels (i.e., a label set with single granularity) for a BoW, which are, however, still insufficient in many applications. In this paper, we first propose the task of hierarchical conceptual labeling (HCL), which explains the semantics of a BoW using hierarchical conceptual labels (i.e., a label set with different granularities). In HCL, the input is a BoW and the output is a hierarchical conceptual label set. For example, given the BoW {China, Japan, France, Germany, Russia}, the possible hierarchical conceptual labels are shown in Figure 1. Clearly, the hierarchy contains two levels: {country} and {Asian country, EU State}. The former has a coarser granularity while the latter has a finer one. In general, the generated label hierarchy for a large-size BoW will contain more levels.

Figure 1
figure 1

The generated hierarchical conceptual labels for a BoW

We have to point out that the task of CL is a special case of HCL, where the first level in the hierarchy is the results of CL. For example, the flat conceptual labels {Asian country, EU State} are suitable results of CL for the BoW in Figure 1. Compared with the flat conceptual labels, the hierarchical labels contain more information, ranging from the most specific labels to more abstract labels for a BoW. This flexibility allows real applications to select labels with different abstractness according to their real requirements.

Besides, in many scenarios, there are inevitably some noise words in a BoW, which is, however, ignored in many existing works [11, 23, 48, 49, 51]. To generate high-quality labels, it is necessary to detect and filter out the noise. To this end, we propose a simple but effective method to delete the noise before the conceptualization operation.

1.1 Motivation

The motivation of HCL comes from two perspectives: psychology and applications.

1.1.1 Motivation from psychology perspective

Humans understand the world by classifying objects into concepts, and this process is often automatic and subconscious [49]. Moreover, the concepts of an object form a set of hierarchically organized categories, ranging from extremely general to extremely specific [34], which motivates us to propose the task of HCL. However, as far as we know, there is no work dealing with this problem, a very novel solution to explicitly explain the semantics of a BoW.

1.1.2 Motivation from application perspective

We argue that the task of HCL is critical for many real scenarios. We envision a range of applications, including:

  • Explanation for hierarchical topic modeling. In topic modeling, a topic is represented by a distribution over words, i.e., a BoW. In general, it is difficult for machines to understand a topic by its word distribution without explanation. Furthermore, topics in many applications (e.g., document topic modeling) are expected to be hierarchical to meet the requirements of multi-granularities [6]. Such hierarchical topics need hierarchical labels for a clear explanation.

  • Interactive Search. Interactive search systems allow users to enter keywords to search for some desired items (e.g., documents or movies). HCL helps us to generate a hierarchical summary over the retrieved items [10,53]. In this way, the systems will better understand the user intention and then precisely recommend some other interesting items for users. Therefore, users only need to focus on the interesting items, thus narrowing the search scope and improving user experience.

Example

We present an example to describe how HCL helps to understand user’s intention. In a shopping system, if a user queries the items “iPhone X” and “iPhone 8” sequentially, then the system will understand that the user is most likely interested in iPhone and tries to recommend some other IPhone products, e.g., iPhone XR and iPhone XS. Moreover, if this user further queries the item “HUAWEI Mate X”, then he (she) may be just interested in “High-end Phone”. The system should recommend some other High-end Phones, e.g., Samsung Galaxy Fold S10 and HUAWEI Mate 20. Fortunately, the results of HCL for the entered item set can strongly support the user’s intention understanding and product recommendation in this example. Specifically, it is expected to generate the multi-granularity conceptual labels: “iPhone” and “High-end Phone” for the items {iPhone X, iPhone 8, HUAWEI Mate X}, which is shown in Figure 2, where the items in the dotted box are recommended by the system using conceptual labels.

Figure 2
figure 2

Example of generating hierarchical conceptual labels for an item set

Moreover, HCL is committed to supporting various recommendation systems (e.g., friend recommendation on Facebook and Twitter, product recommendation in Taobao and Amazon, etc.), thus providing more accurate personalized recommendations.

1.2 Overview and contributions

1.2.1 Overview

The process of HCL can be summarized as follows. Given the raw data, e.g., texts, images and queries, we first construct the corresponding BoWs. Then the noise in BoWs is removed. Finally, the hierarchical conceptual labels are generated for the clean BoWs.

In general, the raw data is enhanced by the generated labels and the combination of them will be taken as inputs for the downstream applications. The framework of generation and use of hierarchical conceptual labels is presented in Figure 3.

Figure 3
figure 3

The framework of generation and use of hierarchical conceptual labels

1.2.2 Contributions

The contributions of this paper can be summarized as follows.

  • We first propose the task of HCL. Besides, we present the conceptualization criteria to guide the label generation.

  • We propose a novel label generation framework based on a popular hierarchical clustering method, i.e., Bayesian rose trees (BRT) [9].

  • We also present an effective denoising algorithm to filter out the noise in a BoW.

1.2.3 Paper organization

The rest of the paper is organized as follows. The BRT algorithm is briefly introduced in Section 2. Section 3 discusses the specific details of the denoising algorithm and HCL. We present the experimental results in Section 4. The related work is surveyed in Section 5. Section 6 gives a brief conclusion and future work.

2 Bayesian rose trees

In this section, we simply introduce the algorithm of Bayesian rose trees (BRT) [9]. BRT is a novel agglomerative hierarchical clustering algorithm that builds a tree from bottom up by combining the two most similar clusters at each step.

Let \(\mathcal {D} = \left \{{{d_{1}},{d_{2}}, {\cdots } ,{d_{N}}} \right \}\) be the entire dataset with N data points (i.e., \(\left | \mathcal {D} \right | = N\)) to be clustered. The BRT algorithm is initialized with N trivial trees {Tk, k = 1,⋯ ,N}, each of which contains a single data point \(\mathcal {D}_{k} = \{d_{k}\}\). In each step, BRT chooses a pair of subtrees Ti and Tj and merges them into Tm. Unlike binary hierarchical clustering, BRT considers the following three possible merging operations.

  • Join.Tm = {Ti,Tj}, that is, Tm has two child nodes, and the corresponding data points \(\mathcal {D}_{m} = \mathcal {D}_{i}\cup \mathcal {D}_{j}\). We assume that, after multi-step clustering, the structures of trees Ti and Tj are shown in Figure 4a. Then the result of the join operation for {Ti,Tj} are presented in Figure 4b

  • Absorb. The absorb operation has two symmetrical cases. The first case is Tm = {children(Ti),Tj}, where children(Ti) denotes the child nodes of Ti. For example, in Figure 4b, children(Ti) = {Ta,Tb,Tc}. That is, Tm has |Ti| + 1 child nodes, and \(\mathcal {D}_{m} = \mathcal {D}_{i}\cup \mathcal {D}_{j}\). The second case is Tm = {Ti,children(Tj)}, that is, Tm has |Tj| + 1 child nodes and \(\mathcal {D}_{m} = \mathcal {D}_{i}\cup \mathcal {D}_{j}\). The merged result of the first case is shown in Figure 4c.

  • Collapse.Tm = {children(Ti),children(Tj)}, that is, Tm has |Ti| + |Tj| child nodes and \(\mathcal {D}_{m} = \mathcal {D}_{i}\cup \mathcal {D}_{j}\). The merged result is shown in Figure 4d.

Figure 4
figure 4

Three possible merging operations in BRT

To determine which two subtrees should be merged as well as which merging operation should be selected, BRT considers the criterion of maximizing the likelihood ratio \(L\left ({T_{m} } \right )\), which is defined as follows:

$$ L\left( {T_{m} } \right) = \frac{{p\left( {\mathcal{D}_{m} \left| {T_{m} } \right.} \right)}}{{p\left( { {\mathcal{D}_{i} } \left| {T_{i} } \right.} \right)p\left( {\mathcal{D}_{j}\left| {T_{j} } \right.} \right)}} $$
(1)

Where \(p(\mathcal {D}_{m}|T_{m})\) is the likelihood of data Dm given the tree Tm, which is recursively calculated as follows:

$$ p\left( {{\mathcal{D}_{m}}\left| {{T_{m}}} \right.} \right) = {\pi_{m}}f(\mathcal{D}_{m}) + \left( {1 - {\pi_{m}}} \right)\prod \limits_{{T_{k}} \in \text{children}(T_{m})} p\left( {{D_{k}}\left| {{T_{k}}} \right.} \right) $$
(2)

Where \(f(\mathcal {D}_{m})\) is the marginal probability of data \(\mathcal {D}_{m}\). That is, \(f(\mathcal {D}_{m})\) denotes the probability that all the data points in \({\mathcal {D}_{m}}\) are generated by the same probability model. πm is the “mixing proportion”, i.e., πm is the prior probability that all the data in Tm is kept in one cluster instead of being partitioned into subtrees. ch(Tm) is the abbreviation of children(Tm).

3 Hierarchical conceptual labeling

In this section, we first introduce the knowledge base used in our framework, i.e., Microsoft Concept Graph (MCG). Then we present how to filter out the noise words in a BoW. Finally, we describe the HCL model based on BRT algorithm.

3.1 Using MCG knowledge

In our framework, we use knowledge bases or semantic networks to provide candidate concepts. In recent years, much effort has been devoted to building such knowledge bases or semantic networks. Some of them, such as WordNet [20] and Freebase [2], are created by human experts or community efforts. Others, such as MCG [16], are created by data-driven approaches.

In this paper, we choose MCG as the candidate concept source for the following two reasons.

  • The knowledge in MCG is very suitable for HCL. MCG provides lots of concept-entity pairs in the form of <concept, entity, frequency>, where the entity is an instance of the concept and the frequency denotes the co-occurrence of the concept-entity pair, such as <country, China, 10723> and <flower, rose, 493>.

  • MCG was created by data-driven approaches with a very large scale. That is, MCG provides 20,757,545 concept-entity pairs, which are extracted from 1.68 billion Web pages in Bing’s Web repository.

To measure the semantic relevance between an entity and a concept, we introduce the typicality score, which plays an important role in enabling us to select proper conceptual labels in our task. Typicality is defined as [42]:

$$ {p\left( {e\left| c \right.} \right) = \frac{{n\left( {c,e} \right)}}{{\sum\nolimits_{{e_{i}}} {n\left( {{c},e_{i}} \right)} }}} \quad {p\left( {c\left| e \right.} \right) = \frac{{n\left( {c,e} \right)}}{{\sum\nolimits_{{c_{i}}} {n\left( {{c_{i}},e} \right)} }}} $$
(3)

Where e is an entity, c is a concept, and \({n\left ({c,e} \right )}\) is the frequency of c and e occurring in a syntactic pattern for an isA relationship. Intuitively, typicality measures how likely we think of an entity (or a concept) when given a concept (or an entity). For example, given the entity rose, people are more likely to think of the concept flower than plant, i.e., \(p\left ({\textit {flower} \left | \texttt {rose} \right .} \right ) > p\left ({\textit {plant} \left | \texttt {rose} \right .} \right )\), indicating that flower more specifically represents the semantics of rose than plant.

In addition to typicality, we also need to define the prior probability of a concept or an entity. In the existing work [41,44], the following formula are used to approximate them:

$$ {p(c) = \frac{{\sum\nolimits_{{e_{i}}} {n\left( {{c},e_{i}} \right)}}}{{\sum\nolimits_{{(c_{j},e_{i})}} {n\left( {{c_{j}},e_{i}} \right)} }}} \quad {p(e) = \frac{{\sum\nolimits_{{c_{j}}} {n\left( {{c_{j}},e} \right)}}}{{\sum\nolimits_{{(c_{j},e_{i})}} {n\left( {{c_{j}},e_{i}} \right)} }}} $$
(4)

Where the denominator denotes the sum of frequencies of all concept-entity pairs in MCG.

3.2 Filtering out the noise

BoWs generated by many scenarios usually contain noise. For example, in topic modeling, each document is modeled as a probability distribution over topics, and each topic is represented by a probability distribution over words (i.e., a BoW). However, some words may not be semantically related to the corresponding topic, thus disturbing us to better understand the topics as well as the documents. These words are, in fact, noise and should be filtered out. In this work, we propose a simple but very effective approach to filter out the noise in advance. The basic idea is: if a word in a BoW is hard to be semantically clustered with any other word, i.e., difficult to be tagged with the same conceptual label as any other word, then we treat it as noise and remove it from the BoW.

Specifically, let \(\mathcal {D}\) be the input BoW, and di (dj) be the i-th (j-th) entityFootnote 1 in \(\mathcal {D}\). We take p(c|di,dj) to measure how well the concept c conceptualizes the semantics of two entities di,dj. Based on the assumption that all entities in \(\mathcal {D}\) are independent to each other [29,44], we use the Bayesian theorem to compute p(c|di,dj) as follows:

$$ p(c|d_{i}, d_{j}) = \frac{p(d_{i},d_{j}|c)p(c)}{p(d_{i},d_{j})} = \frac{p(d_{i}|c)p(d_{j}|c)p(c)}{p(d_{i})p(d_{j})} $$
(5)

Furthermore, assuming that all the entities in \(\mathcal {D}\) have the equal prior probability, i.e., \(p(d_{k}) = \tilde {p}\) (\( \forall d_{k} \in \mathcal {D}\)). Then

$$ p(c|d_{i}, d_{j}) = \frac{1}{\tilde{p}^{2}}p(d_{i}|c)p(d_{j}|c)p(c) $$
(6)

The prior probability p(c) measures the popularity of c, that is, the popular concepts will have a large probability. Intuitively, the larger p(c|di,dj) indicates that di and dj can be well summarized by c, thus di and dj have the strong semantic relevance. p(dk|c) and p(c) are estimated using the knowledge in MCG (see (3) and (4)).

Let \(\mathcal {C}_{i}\) and \(\mathcal {C}_{j}\) be the concept sets of di and dj in MCG, respectively. \(\mathcal {C}_{i,j} = \mathcal {C}_{i}\cap \mathcal {C}_{j}\) denotes the shared concept set of di and dj. We describe the denoising algorithm as follows. Given a word \({d_{i}\in \mathcal {D}}\), for any other word \(d_{j}\in \mathcal {D}\) (djdi), if we cannot find an appropriate concept in \(\mathcal {C}_{i,j}\) to conceptualize di and dj, then di is treated as noise. That is,

$$ \max\limits_{{{d_{j} \in \mathcal{D}}, c \in \mathcal{C}_{i,j}}} p\left( {c\left| {{d_{i}},{d_{j}}} \right.} \right) < {\delta} $$
(7)

Where δ is a pre-given threshold. Considering that \(1/ \tilde {p}^{2}\) is equal for all the words in \(\mathcal {D}\), we take the following simplified form to filter out the noise in \(\mathcal {D}\):

$$ \max\limits_{{{d_{j} \in \mathcal{D}}, c\in \mathcal{C}_{i,j}}} p(d_{i}|c)p(d_{j}|c)p(c) < {\delta} $$
(8)

Equation (8) can avoid estimating \(\tilde {p}\). For each di in \(\mathcal {D}\), if equation (8) is established, then we delete it from \(\mathcal {D}\). The setting of δ will be discussed in our experiments.

3.3 Hierarchical conceptual labeling

We will get a clean BoW after the denoising operation. Next, we describe how to generate hierarchical conceptual labels for a clean BoW based on the BRT algorithm and the knowledge in MCG. The process of HCL can be summarized as follows. We first cluster a BoW hierarchically based on the BRT algorithm, and for each cluster \(\mathcal {D}_{m}\) we generate an appropriate conceptual label for it.

For this goal, we have to address the following three key points. First, based on the description of BRT in Section 2, we need to estimate \(f(\mathcal {D}_{m})\) and πm for our task. Second, we need to present the solution to selecting the appropriate conceptual labels when each step of clustering is completed. Finally, we also need a likelihood ratio thresholdγ, which is used to limit the depth of the hierarchy, thus avoiding to generate too vague conceptual labels.

3.3.1 Calculation of \(f(\mathcal {D}_{m})\)

As we stated before, \(f(\mathcal {D}_{m})\) denotes the probability that all the data points in \({\mathcal {D}_{m}}\) are generated by the same probabilistic model [20]. In many tasks of CL, concepts are widely used as the underlying mechanisms which are responsible for the generation of entities [44]. In our task, \({\mathcal {D}_{m}}\) can also be considered to be generated by concepts in MCG. That is, any concept c in the shared concept set of \({\mathcal {D}_{m}}\) is a probabilistic model that could generate \({\mathcal {D}_{m}}\) with a certain probability, which is denoted as \(p(\mathcal {D}_{m}|c)\). For example, let \(\mathcal {D}_{m}\) be {China,Brazil,India}, the shared concept set is {developing country, emerging market, BRIC country}, and each concept could generate \(\mathcal {D}_{m}\).

Specifically, let \(\mathcal {C}_{m}\) be the shared concept set of all the entities in \(\mathcal {D}_{m}\). Then \({\mathcal {D}_{m}}\) can be generated by any concept in \(\mathcal {C}_{m}\). But which concepts should be selected to generate \({\mathcal {D}_{m}}\)? We argue that each concept \(c_{i}\in \mathcal {C}_{m}\) is selected with a certain probability, which is proportional to p(ci) [44]. Hence we define the selection probability of ci as follows:

$$ p_{s}(c_{i}) = \frac{p(c_{i})}{\sum \nolimits_{c \in \mathcal{C}_{m}}{p(c)}} $$
(9)

Then \(f(\mathcal {D}_{m})\) is computed as

$$ f(\mathcal{D}_{m}) = {\sum\limits_{{c} \in {\mathcal{C}_{m}}} {p_{s}\left( {{c}} \right)p\left( {{\mathcal{D}_{m}}\left| {{c}} \right.} \right)}} $$
(10)

Based on the independence assumption, \(p(\mathcal {D}_{m}|c)\) is calculated as

$$ p\left( {{\mathcal{D}_{m}}\left| {{c}} \right.} \right) = \prod \limits_{{d_{i}} \in {\mathcal{D}_{m}}} p\left( {{d_{i}}\left| {{c}} \right.} \right) $$
(11)

The rationality of equation (10) is interpreted from the following aspects.

  1. 1.

    A larger \(|\mathcal {C}_{m}|\) indicates that the words in \(\mathcal {D}_{m}\) are more similar in semantics, hence deriving a larger \(f(\mathcal {D}_{m})\).

  2. 2.

    \(\mathcal {C}_{m}=\varnothing \) indicates that there is no shared concept for \(\mathcal {D}_{m}\). As a result, \(f(\mathcal {D}_{m})=0\), which implies that the words in \(\mathcal {D}_{m}\) cannot be generated by a single model and should be partitioned into multiple clusters.

3.3.2 Estimation of πm

πm is the prior probability that all the data in Tm is kept in one cluster instead of being partitioned into multiple clusters. In BRT, πm is a hyperparameter that needs to be selected according to the real applications. A larger πm leads to coarser partitions and a smaller πm leads to finer partitions. In this paper, we set the prior πm = 0.5, which denotes that we have no prior knowledge to determine which partitions are more important.

3.3.3 Label selection

The original BRT algorithm only generates the hierarchical clusters for a BoW. In our problem, we further select an appropriate conceptual label to well conceptualize each cluster \(\mathcal {D}_{m}\). Specifically, let \(\mathcal {C}_{m}\) be the shared concept set of \(\mathcal {D}_{m}\). The following criterion is used to select the most appropriate conceptual label:

$$ c^{\ast}_{m} = \arg \underset{c \in {\mathcal{C}_{m}}}{{\max}} p\left( {c\left| {\mathcal{D}_{m}} \right.} \right) = \arg\underset{c \in {\mathcal{C}_{m}}}{ \max } p(\mathcal{D}_{m}|c)p(c) $$
(12)

3.3.4 Likelihood ratio threshold γ

The original BRT algorithm will eventually generate a single tree, that is, all the data in the dataset will be eventually put into one cluster. This is not suitable for our task. In contrast, we should stop the cluster operation when there is no appropriate label to well conceptualize the current cluster. For example, given \(\mathcal {D}_{m} =\) {children, cat, cow, elephant}, the labels, such as creature, are too vague in semantics to conceptualize \(\mathcal {D}_{m}\). So there is no appropriate label that can well conceptualize \(\mathcal {D}_{m}\). In our work, we introduce a threshold (likelihood ratio) γ, and stop clustering when \(L\left ({{T_{m}}} \right ) < \gamma \).

figure a

The process of generating hierarchical conceptual labels for a Bow is shown in Algorithm 1, where \(\mathcal {D}\) is a clean BoW whose noise has been filtered out. In the initialization, we set \(p(\mathcal {D}_{i}|T_{i})=1\) (i = 1,⋯,N) and \(L\left ({{T_{m}}} \right ) = \gamma _{0}\) (γ0 could be any value, as long as γ0 > γ). In each step of clustering, the selection of the pair of trees Ti and Tj as well as the merge operation is determined by (13). We select the concept in \(\mathcal {C}_{m}\) that maximizes \(p(c|\mathcal {D}_{m})\) as the label for cluster \(\mathcal {D}_{m}\) (see 14). The above procedure is repeated until \(L\left ({{T_{m}}} \right ) \leq \gamma \), which aims at avoiding the generation of too vague conceptual labels. We have to point out that there is no necessary relationship among the generated labels in the hierarchy.

3.3.5 Example

We present an example to describe the process of generating hierarchical conceptual labels in Figure 5, where each step represents a clustering operation. Given a BoW {Tiger, Lion, Lily, Rose, Ring, Leaf, Tree, Poplar, Pine}, the noise (i.e., Ring) is filtered out in advance. In the first step of HCL, our method chooses Tree and Poplar, and merges them into a cluster with a generated label Tree. In the second step, the method chooses the cluster generated by the first step and a new entity Pine, and merges them into a new cluster with a generated label Tree. In the third step, the method chooses two entities in Bow, i.e., Tiger and Lion and merges them into a cluster with a generated label Mammal. Then the method will generate the label Flower (Plant) in the fourth (fifth) round. The clustering will be stopped when there is no proper label to further conceptualize the whole BoW. Note that the entity Leaf cannot be conceptualized by the generated labels.

Figure 5
figure 5

Example of generating hierarchical conceptual labels

3.3.6 Labels to be evaluated

After the hierarchical clustering is completed, we can get the generated hierarchical conceptual labels, where the first level of the hierarchy to be evaluated is defined as the follows.

Definition 1

Let \(\mathcal {O}\) be the word set that cannot be conceptualized by our method. Then the labels in the i-th layer are defined as the first level, if and only if the labels in the (i − 1)-th layer cannot conceptualize at least one word in the BoW except for \(\mathcal {O}\).

In Figure 5, \(\mathcal {O}=\{\texttt {Leaf}\}\). The fine-grained label generation process has not yet been completed until step 4, so only the labels in layers 4 and 5 can conceptualize the BoW well with different granularities, i.e, {Mammal, Flower, Tree} and {Mammal, Plant}. Therefore, these labels with hierarchy will be evaluated in our experiments. Note that Leaf is not deleted in advance since it has a certain semantic relevance to Tree. However, it can also not be well conceptualized by the generated conceptual labels.

4 Experiments

We conduct extensive experiments to evaluate the performance of our framework. In all experiments, we set δ = 5 × 10− 8 and γ = 0.8. We first evaluate the denoising algorithm, and then analyze the generated conceptual labels on the synthetic dataset and real dataset, respectively. Finally, the setting of hyperparameters δ and γ is discussed.

4.1 Evaluation of denoising algorithm

4.1.1 Dataset construction

Considering that there is no dataset with ground truth that can be used to evaluate the denoising algorithm, we construct a synthetic one and conduct experiments on it. The synthetic dataset is automatically generated using MCG, where the BoW construction is the same as that in [44]:

Step 1:

We randomly select m concepts from MCG, and then for each concept, we randomly select n instances (i.e., entities) from MCG. The selected mn entities constitute a clean BoW.

Step 2:

We randomly select another l entities as noise from MCG. Hence the selected entities constitute a noise BoW of size mn + l.

To simulate the real applications, where the BoWs have different sizes, we randomly choose six parameter settings to construct clean BoWs, as shown in Table 1. For each parameter setting, 1000 clean BoWs are constructed. Moreover, different noise rates in each BoW are also considered. In our dataset, the noise rates are set as 10%, 20%, 30% and 40%, respectively. That is, for each BoW, we add l noise entities, where

$$ l =\lceil {mn\frac{p}{1-p}} \rceil, \quad p = 10\%, 20\%, 30\%, 40\% $$
(15)

Where ⌈x⌉ maps x to the least integer greater than or equal to x. For example, let m = 3,n = 4 and p = 30%, then l = 5. Therefore, for each parameter setting (e.g., m = 3,n = 4,l = 5), we obtain b = 1000 BoWs.

Table 1 The synthetic BoWs with different parameter settings

4.1.2 Metrics

Let \({\mathcal{B}}_{m,n,p,j}\) (j = 1,...,b) be the j-th BoW with parameter setting {m,n,p}, and \(\mathcal {N}_{m,n,p,j} \subset {\mathcal{B}}_{m,n,p,j}\) be its pre-added noise word set. We apply the denoising algorithm to \({\mathcal{B}}_{m,n,p,j}\) and obtain the filtered words \(\mathcal {N}^{\prime }_{m,n,p,j}\). Thus, \(\mathcal {N}_{m,n,p,j}\cap \mathcal {N}^{\prime }_{m,n,p,j}\) denotes the true noise words detected by the algorithm. In this paper, for each parameter setting {m,n,p}, we define Recall (R), Precision (P) and F1-value (F1) as the metrics to evaluate the denoising algorithm. Specifically,

$$ {P}_{m,n,p} =\frac{1}{b} \sum\limits_{j=1}^{b}\frac{|\mathcal{N}_{m,n,p,j}\cap\mathcal{N}^{\prime}_{m,n,p,j}|}{|\mathcal{N}^{\prime}_{m,n,p,j}|} $$
(16)
$$ {R}_{m,n,p} =\frac{1}{b} \sum\limits_{j=1}^{b}\frac{|\mathcal{N}_{m,n,p,j}\cap\mathcal{N}^{\prime}_{m,n,p,j}|}{|\mathcal{N}_{m,n,p,j}|} $$
(17)
$$ {F1}_{m,n,p} = \frac{2 {P}_{m,n,p} {R}_{m,n,p}}{{P}_{m,n,p} + {R}_{m,n,p}} $$
(18)

where |⋅| denotes the number of words in a set.

4.1.3 Results and analysis

The F1 value for each parameter setting {m,n,p} is presented in Table 2. We present the conclusion as follows. (1) In general, the proposed algorithm efficiently detect the noise from the BoWs with a small noise rate. For example, the F1 value is up to 0.93 under the parameter setting: m = 2,n = 3 and p = 10%. (2) The F1 value begins to decrease with the increase of the BoW size or the noise rate. A BoW with a large size or a high noise rate indicates that there are more noise words in the BoW. Thus, the semantic distances between noise words and non-noise words will be reduced, which increases the difficulty of noise detection. Despite this, our algorithm still owns the strong detection ability for the BoWs with a large size or a high noise rate. For example, the F1 value is 0.82 under the parameter setting {m = 5,n = 5, p = 40%}.

Table 2 The F1 values generated by our denoising algorithm under different parameter settings

4.2 Evaluation on synthetic dataset

To the best of our knowledge, there is no related work to generate hierarchical conceptual labels. But several baselines generating flat labels have been proposed. So we evaluate the local performance of HCL in this section. That is, the generated labels in the first level of the hierarchy are evaluated and compared with the counterparts generated by baselines.

4.2.1 Baselines

The first-level conceptual labels generated by our algorithm are very specific to conceptualize a BoW. We compare the generated labels in the first level with the following two state-of-the-art methods.

  • Clustering-then-conceptualization (CC). CC is the extension based on the state-of-the-art single concept conceptualization approach [41]. To implement CC, we first cluster the input BoW by k-means according to their concept distributions in MCG. Then we select the best single concept for each cluster using a naive Bayes model proposed by Song et al. [41].

  • MDL based model [44]. This model aims at generating a minimum set of conceptual labels that strongly conceptualize a BoW. Two criteria are proposed to measure the “goodness” of a candidate label set, i.e., (1) the high semantic coverage of a BoW, and (2) the small size of the candidate label set. Furthermore, to trade-off the coverage against the size, the minimum description length (MDL) principle is used to select the best conceptual label set.

4.2.2 Metrics

We still use the method described in Section 4.1.1 to generate synthetic BoWs. In our experiments, we set m = 5 and l = 5, and analyze the effect of n on the performance.Footnote 2 For each n (n = 2,4,6,8,10), we automatically generate b = 1000 BoWs. For each BoW, we first filter out the noise by the algorithm proposed, and then take the three methods to generate conceptual labels (for our method, we select the labels in the first level). For the i-th (i = 1,...,b) BoW, we denote the number of generated conceptual labels as xi, and yi of them are the ground-truth labels (i.e., the selected m = 5 concepts in Step 1 in Section 4.1.1). We use precision (P), recall (R) and F1-value (F1) to measure the performance of each method, that is,

$$ P = \frac{{\sum\nolimits_{i = 1}^{b} {{y_{i}}} } }{ {\sum\nolimits_{i = 1}^{b} {{x_{i}}} } } \quad R = \frac{ {\sum\nolimits_{i = 1}^{b} {{y_{i}}} }} {{mb}} \quad F1 = \frac{{2PR}}{{P + R}} $$
(19)

4.2.3 Results

The results are shown in Figure 6. We conclude that, (1) the proposed method is superior to the other two baselines in generating flat conceptual labels. (2) The performance measured by precision and recall increases with the number of entities for both models. The reason is that more entities selected from one concept (a ground-truth label) will strongly support models to select this concept as the label. (3) Our experiment directly indicates that the conceptualization for BoWs with large size (e.g., extracted from long texts) is easier compared with that of BoWs with small size (e.g., extracted from short texts), which is consistent with the conclusion that the understanding of short texts is more difficult compared with that of long texts [22,47].

Figure 6
figure 6

The precision, recall and F1-value of the three methods

4.3 Evaluation on real dataset

The evaluation above proves the effectiveness of our model in generating flat conceptual labels, i.e., the labels in the first level. Furthermore, we also conduct experiments to evaluate the global performance of our model in real applications, that is, evaluating the hierarchical conceptual labels on real data. There are two challenges for the experiments. The first is that there is no baseline generating hierarchical conceptual labels, and the second is that there are no evaluation criteria for our task. To overcome the first challenge, we construct two new baselines. For the second challenge, we propose the criteria and take a human evaluation mechanism. Next, we describe the evaluation in detail.

4.3.1 Dataset construction

The real dataset used in this section contains two subsets: Flickr data and Wikipedia data.

  • Flickr data. Image tags in FlickrFootnote 3 are generally redundant and noise. Hierarchical conceptual labels refine the tags and help machines understand the original tags as well as the images.

  • Wikipedia data comes from the results of topic modeling[5,8] running on the entire Wikipedia corpusFootnote 4. Then we extract the top words of each topic as a BoW. HCL for the topic words is critical for machines to understand and interpret the topics.

Each data subset contains b = 300 BoWs, respectively. The generated labels for these BoWs are evaluation objects.

4.3.2 Baselines

We present two strong baselines constructed by ourselves.

  • Bayesian hierarchical clustering based model (BHC). BHC [20] is a popular model among the hierarchical clustering methods, which generates hierarchies with a binary branching structure. We first use BHC to cluster the BoWs with hierarchy. Then each node in the hierarchical structure is equipped with a concept as the label to conceptualize the corresponding subset of BoW, where the candidate concepts are also from MCG. By this way, we generate hierarchical conceptual labels.

  • Maximal clique segmentation based model (MCS). In this method, we first construct a semantic graph to associate all entities in a BoW, where each vertex corresponds to an entity and the weight of edge corresponds to the similarity between two entities. The similarity between two entities di and dj is defined by (11), where \(\mathcal {C}_{m}\) is replaced by {di,dj}. Then we take the idea of maximal clique segmentation [43,45] to divide the semantic graph into several parts given a weight threshold, thus clustering the BoW into several clusters according to the semantic similarity. Finally, we select one concept label for each cluster, thus generating a flat conceptual label set. Moreover, when considering multiple (here we set k = 3) weight thresholds, we can cluster the BoWs in different ways, thus generating multiple label sets with different levels for one BoW.

We take our model as well as the two baselines to generate hierarchical conceptual labels for these BoWs, where two cases are considered: taking (without taking) the denoising algorithm in advance.

4.3.3 Evaluation criteria

In our experiments, we take a manual scoring approach to evaluate the performance when generating hierarchical conceptual labels for real data. There are two reasons for it. (1) There are no objective evaluation criteria for this task in previous work. (2) Given a BoW in real applications, there is no ground truth for its best hierarchical conceptual labels. In fact, a BoW may have several reasonable labeling results. Given a toy example in Figure 7, both (a) and (b) are reasonable for the BoW {China, India, France, Germany}.

Figure 7
figure 7

Two labeling results for the same Bow

We recruit v = 5 volunteers and ask them to evaluate the labeling results by scoring, where the scoring criteria are motivated by [44]. In general, good hierarchical conceptual labels for a BoW require that the labels in each level of the hierarchy satisfy the much specificity and high coverage. The specificity means that the labels should strongly conceptualize the BoW. For example, the label pop singer has more specificity compared to celebrity for conceptualizing the sample BoW B = {Jay Chou, Andy Lau}, i.e., p(pop singer|B) > p(celebrity|B). The high coverage means that the labels should conceptualize as many words in the BoW as possible. We divide the quality of labels in each level into the following four grades:

  • Perfect (3). If the labels in a level can well conceptualize the BoW, the score of this level is 3. For example, given the BoW {volleyball,basketball,football}, we easily think of ball game, which is an appropriate label with score 3.

  • Minor loss in specificity or coverage (2). Score 2 means that the labels in a level have minor losses in specificity or coverage. For example, given the BoW {meal, dinner, wedding, breakfast, ceremony, food}, the label meal gets score 2. The reason is that meal could only well conceptualize the first four words, without including ceremony and wedding, which loses minor coverage. One of the appropriate label sets for this BoW is {meal, wedding} with score 3. For another example, given the BoW {French, UK, Germany, Italy}, the label country loses minor specificity, which is less specific than European country. So the label country has score 2 and European country has score 3.

  • Much loss in specificity or coverage (1). If the label set loses much specificity or coverage, its score is 1. For example, given the BoW {poplar, pine, cherry, rose}, the label tree loses much coverage, since 50% of the words (cherry, rose) are not covered by tree in semantics. The suitable label set is {tree, flower} with score 3. For another example, given the BoW {puppy, kitten, piggy}, the label creature loses many specialties, because there are at least three more specific concepts in MCG, i.e., pet, mammal and animal. So the score of creature is only 1. Furthermore, we consider that pet has score 3 and mammal (animal) has score 2 for this BoW.

  • Misleading or unrelated (0). Sometimes, the generated labels may be unrelated to the BoW or misleading, thus getting score 0. For example, the generated label improvement cannot conceptualize the BoW {walkway, swimming pool, vehicle}.

4.3.4 Metric

The evaluation of hierarchical conceptual labels is conducted by averaging the scores of all the levels in the hierarchy. For each data subset, let the number of levels in the hierarchy for the j-th (j = 1,...,300) BoW be Lj, and the i-th (i = 1,2,...,5) volunteer’s scores for the hierarchy of the j-th BoW are si,j,k (k = 1,...,Lj). The average score is computed as:

$$ \text{Average Score} =\frac{1}{bv}\sum\limits_{i=1}^{v}\sum\limits_{j=1}^{b}\left( \frac{1}{L_{j}}\sum\limits_{k=1}^{L_{j}}{s_{i,j,k}}\right) $$
(20)

Where b is the number of BoWs in each subset, and v is the number of volunteers.

4.3.5 Results

The results are presented in Table 3. We conclude that (1) the scores when considering the denoising algorithm are higher than these without denoising algorithm for all models. It further proves the necessity of denoising in conceptual labeling task, which is, however, ignored by most previous work. (2) The proposed model outperforms the other two baselines. In particular, BHC only considers the binary branching structures in the hierarchy and cannot generate multi-branching structures that are frequently appeared in clustering BoWs (see Figure 1). MCS only clusters BoWs into multi-level label sets without considering hierarchy that helps generate hierarchical conceptual labels. (3) The performance on Wikipedia data is better than that on Flickr data. Because each text in Wikipedia is more inclined to express a smaller number of topics, but the semantics in a picture is usually more dispersed, which increases the difficulty of conceptualization.

Table 3 Average scores on Flickr and Wikipedia data (where each score is divided by 5)

4.4 Discussion of thresholds

Noise threshold δ

A large δ may cause the non-noise words to be filtered out. However, too small δ will cause some noise words cannot be filtered out. To select an appropriate δ, we first construct 1000 synthetic BoWs with random parameter setting (require the noise proportions less than 40%), then calculate the accuracy according to (16) under different δ settings. The results are shown in Table 4, indicating δ = 5 × 10− 8 is a better choice, where the magnitude “10− 8” is caused by the small p(c) for concept c.

Table 4 The accuracy under different thresholds

Likelihood ratio threshold γ

In order to balance the depth of hierarchy and specificity, we need to select γ appropriately. Specifically, we randomly select 200 samples from the above 1000 BoWs and filter out the noise in advance. Then set γ = 0.6,0.8,...,2.0 and use our method to generate labels for these BoWs, respectively. Finally, we manually evaluate the labels according to their depth and specificity, and conclude that γ = 0.8 is an appropriate setting.

4.5 Case study

We present some samplesFootnote 5 generated by our model in Figure 8. In general, each sample is consistent with our intuition, i.e., the generated labels could well conceptualize the corresponding words. Moreover, our model also generates labels with more specificity compared to the concepts people think of. For example, most people usually think of the concept color when given the words {pink, yellow}. However, in addition to color, our model also captures the fine-grained concept bright color for it. The generated concepts with different granularity can be used in many downstream AI tasks.

Figure 8
figure 8

Three samples generated by our model

5 Related work

5.1 Conceptualization

Conceptualization is a very important task for natural language understanding (NLU), and it maps a text or a BoW to several concepts that are pre-defined in a certain taxonomy or knowledge base [8, 13, 22, 23, 48].

When considering one entity, Wu et al. [51] constructed a large taxonomy, where each entity has abundant concepts, which laid the foundation for downstream applications. Wang et al. [10] proposed a Bayesian model using typicality and PMI to label an entity with a basic-level concept. When considering a BoW or a text, Hua et al. [22] leveraged co-occurrence network for concept inference. Song et al. [41] used a Bayesian model as well as clustering to generate multiple labels for a short text. Sun used the minimum description length (MDL) principle to generate a set of conceptual labels [44] for a bag of words. These solutions aimed at generating flat conceptual labels for short texts, entities or unweighted bag of words. Conceptualization can also be used in some other applications, e.g., image understanding [14, 28], identifying user intention [21], computing term similarity [26], etc. However, there is no work to focus on HCL for BoWs.

5.2 Conceptualization in topic modeling

Conceptualization is also widely combined with the task of topic modeling, which aims at generating conceptual labels to explain the topics represented by a distribution over words.

The early effort relies on humans to find meaningful labels [30, 42]. However, manual labeling requires a great human effort and is prone to subjectivity [46]. To alleviate it, probabilistic approaches were proposed to interpret the multinomial topic models automatically and objectively [31]. These approaches achieved the automatic interpretation of topics, but the candidate labels available were limited to phrases inside documents. To overcome this limitation, Lau et al. [24] proposed an automatic topic label generation method that obtains candidate labels from Wikipedia articles containing the top-ranking topic terms, top-ranked document titles, and sub-phrases.

The conceptualization above was conducted without supervision. To improve the labeling accuracy, supervised labeling was proposed, such as supervised latent Dirichlet allocation (sLDA) [7], labeled LDA (LLDA) [37], etc.

6 Conclusion and future work

The explicit semantics is very important for machines to better understand the natural language. This paper proposes the task of HCL, which aims to generate conceptual labels with different granularity for BoWs. To achieve it, we propose the BRT based approach, and the experiments show the high performance. Since a BoW usually contains noise, we also propose the denoising algorithm, which can effectively filter out the noise in advance and help the labeling approach to better represent the explicit semantics of a BoW.

There are some topics worthy of our further study. (1) How to make full use of the hierarchical labels of a BoW in downstream tasks? There are many tasks involving semantic understanding, such as text summarization, reading comprehension, recommendation systems, and so on. We can combine the results of conceptualization into the current models to help improve the performance of these tasks. (2) In many scenarios, we can model the text as a bag of weighted words (e.g., probabilistic topic model, TF-IDF, etc.). How to conceptualize a bag of weighted words is another interesting problem.