Keywords

1 Introduction

Facet mining is the process of exploration and discovery within an information collection of selected choices, and it is used to summarize the knowledge and information contained in the knowledge base with high-quality structured data, large databases of text-annotated objects, semantically related feature set such as WordNet for a query [1,2,3]. Query reformulation and query recommendation are the methods widely used by the users to explore their intention. Direct and instant answer groups are available with faceted search, e.g., cell phone groups with display size, color, OS used. In the case of vague, ambiguous queries which are short in nature and noisy, facets are useful by provided with a clarified list of choice of user interest. Similar to normal and approximate query results, facets are also abundant in nature. In order to avoid the facet boom, it is possible to rank the facets and could display the top-k diversified links of choice [4,5,6,7]. Table 1 shows the facet lists to item “watch” in various categories. Facet lists to watch are crawled for four categories of watch, namely watch category, watch brand, watch color, and watch type. The user can display the query facets along with top search results, since the facets reveal the different aspects of a query, the user will get an additional set of results which may relate to the given query and could be used for approximate search process [8,9,10,11].

The quality of a facet can be measured in terms of specificity and dispersion. Specificity is the quality of belonging or relating uniquely to a particular subject. Dispersion can be defined as the action of distributing something or appearance of a facet on multiple lists.

Table 1 Query facets for query: “Watches”

The proposed dynamic facet mining process for Table 1 data generates faceted items with high ranks on an online database by a systematic ranking process based on metadata and is given in Fig. 1.

Fig. 1
figure 1

Dynamic facet mining process

2 Literature Review

Takahashi et al. [12] proposed a new, user behavioral approach such as link anomaly-based detection, to detect the emergence of topics in a social network stream. The basic idea of this approach is to focus on the social aspect of the posts reflected in the mentioning behavior of users instead of the textual contents. The works by Yan and Wan [13] used a heterogeneous ranking algorithm called SRRank. A heterogeneous graph is constructed for sentences, semantic roles, and words as nodes. Using SRRank algorithm, the nodes are ranked and the nodes with the highest ranks are taken for generating summaries.

Transitional query suggestion approaches such as annotations, query logs, online summarization methods are used to design database schemas and to generate metadata attributes [8, 14,15,16].

Dou et al. [5] proposed a framework to list out the facets from millions of tweets called QDMiner. QDminer extracts lists from the top search results, groups them into clusters based on the features and ranks them according to how they appear in the search results.

Vandic et al. [4] proposed a framework for dynamic facet mining based on specificity and the dispersion of facet values. The proposed facet optimization algorithm ranks properties based on their importance and sorts the values within each property.

3 Methodology

The proposed facet mining process has various phases such as facet extraction, facet weighing, facet list clustering, facet ranking, item ranking. Facets are used to prepare multiple groups of semantically related queries [3].

3.1 Problem Statement

For a document \(\mathrm{Doc}_{i}\) annotated with metadata \({M}_{k}\) and a query \({Q}_{i}\) produce facets for each \(\mathrm{Doc}_{i}\) with weight \(\mathrm{Doc}_\mathrm{fi}\), these facets are clustered and grouped to sort out top-k search items with ranks \(\mathrm{WRank}_\mathrm{ei}\) where fi is the weight of each facet associated with that document and ei is the rank of an item e.

3.2 Facet Extraction

The FMAD performs a facet extraction of metadata of documents [6, 17, 18]. It can also able to extract the required information from web forms by applying tag level extraction. The metadata of the document contains lots of information about the contents and properties related to that document. The meta properties like title, type, url, image, description, site name, sign-in client-id, sign-in cookie policy, sign-in scope are useful for achieving document categorization and data security [8, 19]. Table 2 lists out the results of metadata extraction of the query “watches” over ClueWeb dataset. After list extraction, the obtained results are listed as facets such as category, brand, color, and type.

Table 2 Examples for data extraction from web page

3.3 Facet Weighing

The weight of a facet in a document is calculated based on the facet count in the document and the number of shared keywords in facet keyword lists and document. Let \({M}_{k}\) be the metadata keyword lists for a facet. If a document accounts for the presence of more number of a particular facet in it, then that document shows the high-term frequency for that particular facet; hence, facet count is valuable for facet weight (\(\mathrm{Facet}_{w}\)) calculation. In order to balance the effect of common items in a document inverted document frequency (IDFE) of that document, represented as \(\mathrm{idf}_{e}\) can be considered during weight calculation and the entire equation is given in (1).

$$\begin{aligned} \mathrm{Facet}_{{w}} =\mathrm{Doc}_{{w}}*\mathrm{Facet}_{\text {count}}*\frac{\mathrm{idf}_{{e}}}{|l|} \end{aligned}$$
(1)

where

$$\begin{aligned} \begin{aligned} \mathrm{Doc}_{{w}}=\frac{N_{\text {ld}}}{|l|} \end{aligned} \end{aligned}$$
(2)

where \(N_\mathrm{ld} \le M_{k}\) is the number of shared keywords in the facet keyword list and the document used and l is the total number of lists identified as given in (2). \(\mathrm{idf}_{e}\) is calculated as in (3).

$$\begin{aligned} \begin{aligned} \mathrm{idf}_{{e}}=\log \frac{N-N_{{e}}+0.5}{N_{{e}}+0.5} \end{aligned} \end{aligned}$$
(3)

where N is the total number of words in the given document and \(N_e\) is the number of terms which has a direct or semantic relation to the given item e. After facet weighing normalized value list obtained is given as (Category: 1, Brand: 0.40, Color: 1.0, Type: 0.2).

3.4 Facet List Clustering

During facet clustering list with similar intentions are grouped together. If two facet lists are related proportionally distance between them can be calculated as in (4). Quality threshold with weighted data point and web page count (WQTWC) is used to reduce the overlap of clusters.

$$\begin{aligned} \begin{aligned} {d}(f_{{l_1}},f_{{l_2}})=1-\frac{|f_{{l_1}}\cap f_{{l_2}}|}{\min (|f_{{l_1}}|,|f_{{l_2}}|)} \end{aligned} \end{aligned}$$
(4)

The pairwise distance obtained between the facet lists is \({d}(f_{l_1}, f_{l_2})=0.2\), \({d}(f_{l_1}, f_{l_3}) = 1\), \({d}(f_{l_1},f_{l_4})=0.94\), \({d}(f_{l_2},f_{l_3})= 1\), \({d}(f_{l_2}, f_{l_4})= 0.91\), \({d}(f_{l_3}, f_{l_4})= 1\). WQTWC algorithm sets a cluster diameter of user choice and here it is 0.96. All the points with threshold value up to 0.96 are locked in the initial cluster as given in Algorithm 1. Algorithm 1 chooses minimum weighted list \(W_{\min }\) from faceted list. Then it choses a threshold diameter \(\mathrm{Diameter}_{\max }\) and randomly selected a target facet list with weight \(\mathrm{Target}_\mathrm{weight}\). The pairwise distances among lists are calculated and made a core cluster for initializing the clustering. Then set \(\mathrm{Target}_\mathrm{weight}\) as locus of the core cluster. After this added all faceted lists with diameter less than \(\mathrm{Diameter}_{\max }\) to core cluster. The remaining faceted lists are also clustered similarly. A lookup table table is maintained for faceted list for getting website count. Based on website count performed a re-clustering operation to produce final clusters. The algorithm produced three clusters, namely \(C_1\), \(C_2\), \(C_3\), are shown in Fig. 2. The faceted lists \(d(f_{l_1}, f_{l_2}, f_{l_4})\) are grouped in cluster \(C_1\). Faceted lists \(d(f_{l_3})\) are grouped in cluster \(C_2\), and faceted lists \(d(f_{l_4})\) are grouped in cluster \(C_3\). The typical WQT algorithm works on weighted list and grouped \(f_{l_4}\) in cluster \(C_1\). The normalized website count for the given faceted lists is calculated as (Category: 1, Brand: 1, Color: 0.6, Type: 0.01). The website count for \(f_{l_4}\) is comparatively less and that increased its threshold distance above 0.96, and it is grouped in \(C_3\).

figure a
Fig. 2
figure 2

Facet clustering with WQTWC algorithm

Fig. 3
figure 3

During facet ranking related groups are clustered

3.5 Facet Ranking

Facet ranking is the process of grouping lists based on their hamming distance in each cluster. The weight of a group is the number of lists in that group. Here the group is defined as the lists in a cluster which have close similarity. A list grouping approach is used for this purpose. This approach set minimum weight \(W_{\min }=1\) and a threshold value FLdup are identified as the similarity hamming distance, in order to form a sub-group in selected clusters. The cluster with the highest weighted lists is identified, and the highest weighted list is made as the locus of subgroups within the group. The other lists with weight less than FLdup added to the new subgroup. The same process has done to other n top-most clusters. The hamming distance for watch category \(d(f_{l_1})\) and watch brand \(d(f_{l_2})\) is calculated as in (5), and the subgroups formed in cluster \(C_1\) are shown in Fig. 3.

$$\begin{aligned} \begin{aligned} {\text {HD}}_{{f}_{l_1},{f}_{l_2}} =1-\frac{{\text {hd}}(d({f}_{l_1}, {f}_{l_2}))}{\text {LFP}} \end{aligned} \end{aligned}$$
(5)

where LFP is the length of fingerprint used by default it is 64, and \(\mathrm{HD}_{f_{l_1}, f_{l_2}}\) is the hamming distance between two faceted lists (\(f_{l_1}\), \(f_{l_2}\)).

3.6 Item Ranking

The importance of an item e in a list depends on its rank in the list and the number of lists accessing it. The rank of faceted item in a group depends on the item count in the lists of that facet. The weight of an item in a faceted list \(f_{l_i}\) is calculated as in (6).

$$\begin{aligned} \begin{aligned} {\text {WRank}}_{\text {ek}} =\sum \limits _{{f}_{l_{i}}\in {C}}\frac{1}{\sqrt{\text {AvgRank}}} \end{aligned} \end{aligned}$$
(6)

where average facet rank AvgRank is calculated as

$$\begin{aligned} \begin{aligned} {\text {AvgRank}} =\frac{1}{|{\text {SFL}}|}\sum \limits _{{l}\in \text {SFL}}{\text {Rank}}_{{e}} \end{aligned} \end{aligned}$$
(7)

where SFL is the set of all list in the facet and the faceted group which contain the item e.

4 Performance Evaluation

4.1 Effect of List Weighing Method

A list which contains a term frequently in it would show a high document weight for it, but at the same time some valid facets may degrade because of less frequent appearance. If including \(\mathrm{idf}_{e}\) in calculation this problem can solve. Document weight as obtained from (1) is the measure of facets which are on various lists, and it improved the facet count in lists. Figure 4 shows the effect of list weighting in facet mining.

Fig. 4
figure 4

Facet weight with various measurement methods

Fig. 5
figure 5

Effect of search result quantity during facet mining

4.2 Effect of Search Result Quantity

When the number of search results is increased, then the facet groups are enriched with more lists; these have more facets, and hence, the overall quality of facets is improved, and Fig. 5 shows this. From Sect. 3.3, it is clear that the facet count has an important role in the facet list weighing process. So if the results are more there is a possibility of their facet counts also to be more. Specific properties whose facets match many products have high impurity.

5 Conclusion

The proposed facet mining approach followed a series of phases such as facet clustering and grouping in order to organize the items, for which user is able to drill down with an intention over a navigational search platform. Because of the abundance of facet results, the FMAD followed a facet weighing and clustering scenario. Quality clusters are derived by considering disjoint facet count. Hamming distance is measured for cluster grouping by considering each facet list in a cluster. Average facet rank is calculated to list out top-k items which are specifically for query and dispersed on query dimension.