Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Since a web page is characterized by several dimensions (i.e. textual, structural based on HTML tags and visual/structural based on hyperlinks) the existing clustering algorithms differ in their ability of using these representations. In particular, algorithms based on textual representation typically group web pages using words distribution [6, 13]. These solutions manage web pages as plain text ignoring all the other information of which a page is enriched and turn to be ineffective in at least two categories of web pages: (i) when there is not enough information in the text; (ii) when they have different content, but refer to the same semantic class. The former case refers to web pages with poor textual information, such as pages rich of structural data (e.g. from Deep Web Databases), multimedia data, or that have scripts which can be easily found also in other pages (e.g. from a CMS website). The latter case refers to pages having the same semantic type (e.g. web pages related to professors, courses, publications) but characterized by a different distribution of terms. On the other side, clustering based on structure typically considers the HTML formatting (i.e. HTML tags and visual information rendered by a web browser) [2, 3, 7]. Algorithms which use these information, are based on idea that web pages are automatically generated by programs that extract data from a back-end database and embed them into an HTML template. This kind of pages show a common structure and layout, but differ in content. However, because tags are used for content displaying, it happens that most of the web pages in a website have the same structure, even if they refer to distinct semantic types. This negatively affect clusters’ quality. The above described solutions exploit within-page information. Other algorithms make use of the graph defined by the hyperlink structure of a set of web pages [18, 22]. Hyperlinks can be used to identify collections of web pages semantically related and relationships among these collections. In this area, DeepWalk and Line [18, 22] are two embedding-based methods that exploit neural networks to generate a low-dimensional and dense vector representation of graph’s nodes. DeepWalk [18] applies the skip-gram method on truncated random walks to encode long-range influences among graph’s nodes. Still, this approach is not able to capture the local graph structure (i.e. nodes which can be considered similar because are strongly connected). Line [22] optimizes an objective function that incorporates both direct neighbours and neighbours of neighbours. However, both methods (DeepWalk and Line) ignore node attributes (e.g. textual content).

Most of the discussed works analyze contents, web page structure (i.e. HTML formatting) and hyperlink structure almost independently. Over the last decade, some researchers tried to combine several sources of information together. For example, [1, 16] combine content and hyperlink structure for web page clustering, [4, 7, 19] combine web page and hyperlink structure for clustering purposes. This paper is a contribution in this direction. It combines information about content, web page structure and hyperlink structure of web pages homogeneously. It analyzes web pages’ HTML formatting to extract from each page collections of links, called web lists, which can be used generate a compact and noise-free representation of the website’s graph. Then, the extracted hyperlink structure and content information of web pages are mapped in a single vector space representation which can be used by clustering algorithms. Our approach is based on the idea that two web pages are similar if they have common terms (i.e. Bag of words hypothesis [23]) and they share the same reachability properties in the website’s graph. In order to consider reachability, the solution we propose is inspired by the concept of Distributional Hypothesis, initially defined for words in natural language processing (i.e. “You shall know a word by the company it keeps”) [9] and recently extended to generic objects [11]. In the context of the Web we can translate that citation in “You shall know a web page by the paths it keeps”(i.e. two similar web pages are involved in the same paths).

2 Methodology

The proposed solution implements a four steps strategy: in the first step website crawling is performed. Crawling uses web pages’ structure information and exploits web lists in order to mitigate problems coming from noisy links. The output of this phase is the website graph, where each node represents a single page and edges represent hyperlinks. In the second step, we generate a link vector by exploiting Random Walks extracted from the website’s graph. In the third phase content vectors are generated. In the last one, a unified representation of pages is generated and clustering is performed on such representation.

2.1 Website Crawling

A Website can be formally described as a direct graph \(G = (V,E) \), where V is the set of web pages and E is the set of hyperlinks. In most cases, the homepage h of a website represents the website’s entry page and allows the website to be viewed as a rooted directed graph. As claimed in [7] not all links are equally important to describe the website structure. In fact, a website is rich of noisy links, which may not be relevant to clustering process, such as hyperlinks used to enforce the web page authority, short-cut hyperlinks, etc.... Besides, the website structure is codified in navigational systems which provide a local view of the website organization. Navigational systems (e.g. menus, navbars, product lists) are implemented as hyperlink collections having same domain name and sharing layout and presentation properties. Our novel solution is based on the usage of web lists. This has a twofold effect: from one side it guarantees that only urls useful to the clustering process are considered; on the other side, it allows the method to implicitly take into account the web page structure which is implicitly codified in the web lists available web pages [7, 15, 24]. Starting from the homepage h, a crawling algorithm iteratively extracts the collection of the urls having same domain of h and organized in web lists. Only web pages included in web lists are further explored. Following [15], a web list is:

Definition 1

A Web List is a collection of two or more web elements having similar HTML structure, visually adjacent and aligned on a rendered web page. The alignment is identified on the basis of the x-axis (vertical list), the y-axis (horizontal list), or in a tiled manner (aligned vertically and horizontally).

Fig. 1.
figure 1

Web lists extracted from a web page taken from www.cs.illinois.edu (Color figure online)

Figure 1 shows, in red boxes, web lists extracted from the homepage of a computer science department which will be used for website crawling. Links in box A will be excluded because their domains are different from the homepage’s domain. To identify from a web page the set of web lists we implement HyLien [10]. The output of website crawling step is the sub-graph \(G' = (V', E')\), where \(V'\subseteq V\) and \(E' \subseteq E\) will be used for link and content vectors generation steps.

2.2 Link Vectors Generation Through Random Walks

A random walk over a linked structure is based on the idea that the connections among nodes encode information about their correlations. To codify these correlations we use the Random Walk with Restart (RWR) approach. RWR is a Markov chain describing the sequence of nodes (web pages) visited by a random walker. Starting from a random point i, with probability (\(1- \alpha \)) a walker walks to a new, connected neighbor node or, with probability \(\alpha \), it restarts from i.

Inspired by the field of information retrieval, we model a web page as a word, that is, a topic indicator and, each random walk as a document constituting the natural context of words (i.e. topical unity). Thus, we represent a collection of random walks as a document collection where topics intertwine and overlap. This enable the application of a distributional-based algorithm to extract new knowledge [21] from the obtained representation. In our case, we apply the skip-gram model [17], a state-of-art algorithm, to extract a vector space representations of web pages that encode the topological structure of the website. In the skip-gram model we are given a word w in a corpus of words \(V_W\) (in our case a web page w belonging to random walks) and its context \(c \in V_C\) (in our case web pages in random walks which appear before and after the web page w). We consider the conditional probabilities p(c|w), and given a random walks collection Rws, the goal is to set the parameters \(\theta \) of \(p(c|w; \theta )\) so to maximize the probability:

$$\begin{aligned} \mathop {{argmax}}\limits _{\theta } \prod _{L \in Rws; w \in L}\left[ \prod _{c \in C_L(w)} prox_L(w,c)\cdot p(c|w;\theta ) \right] \end{aligned}$$
(1)

where L is a random walk in Rws, w is a web page in L and \(C_L(w) = \{w_{i-k}, \dots , w_{i-1}, w_{i+1}, \dots , w_{i+k}\}\) is the set of contexts of web page w in the list L. Moreover, \(prox_L(w,c)\) represents the proximity between w and \(c \in C_L(w)\). This is necessary since the skip-gram model gives more importance to the nearby context words than distant context words. One approach for parameterizing the skip-gram model follows the neural-network language models literature, and models the conditional probability \(p(c|w;\theta )\) using soft-max: \( p(c|w;\theta ) = \frac{e^{v_c \cdot v_w}}{\sum _{c' \in V_C} e^{v_{c'} \cdot v_w} } \), where \(v_c\), \(v_{c'}\) and \(v_w \in \mathbb {R}^d\) are vector space representations for \(c, c'\) and w respectively (d is defined by the user). Therefore, the optimization problem (1) leads to the identification of the web page and context matrices \(W = \{v_{w_i}| w_i \in V_W \}\) and \(C= \{v_{c_i}| c_i \in V_C \}\). They are dependent each other and we only use W to represent web pages (coherently with what proposed in [17] for words). The computation of \(p(c|w;\theta )\) is computationally expensive due the summation \(\sum _{c' \in V_C}\) and thus in [17] are presented hierarchical softmax and negative-sampling approach to make the computation more tractable. Therefore, given in input to skip-gram model a corpus data composed by the collection of random walks, it returns the matrix W which embeds each web page into a dense and low-dimensional space \(\mathbb {R}^d\).

2.3 Content Vectors Generation

Here we describe the process for generating a vector representation of web pages using textual information. Differently from traditional documents, web pages are written in HTML and contain additional information, such as tags, hyperlinks and anchor text. To apply on web pages a bag-of-words representation we need a preprocessing step, in which the following operations are performed: HTML tags removal (however, we maintain terms in anchor, title and metadata since they contribute to better organize web pages [8]); unescape escaped characters; eliminate non-alphanumeric characters; eliminate too frequent (\({>}90\%\)) and infrequent (\({<}5\%\)) words. After preprocessing, each web page is converted in a plain textual document and we can apply the traditional TF-IDF weighting schema to obtain a content-vector representation. Due the uncontrolled and heterogeneous nature of web page contents, vector representation of web pages based on content is characterized by high-dimensional sparse data. To obtain a dense and low-dimensional space we apply Truncated SVD algorithm, a low-rank matrix approximation based on random sampling [12]. In particular, given the TF-IDF matrix of size \(|V'|\times n\) and the desired dimensionality of content vectors m, where \(m<< n\), the algorithm returns a matrix of size \(|V'|\times m\).

2.4 Content-Link Coupled Clustering

Once the content vector \(v_c \in \mathbb {R}^m\) and the link vector \(v_l \in \mathbb {R}^d\) of each web page in \(V'\) have been generated, the last step of the algorithm is to concatenate them in a new vector having dimension \(m+d\). Before the concatenation step we normalize each vector with its Euclidean norm. In this way we ensure that components of \(v_l\) having highest weights are as important as components of \(v_c\) having highest weights. The generated matrix preserves both structural and textual information and can be used in traditional clustering algorithms based on vector space model. In this study we consider K-MEANS and H-DBSCAN [5] because they are well known and present several complementary properties.

Table 1. Description of websites

3 Experiments

In order to empirically evaluate our approach, we performed validation four computer science department’s websites: Illinois (cs.illinois.edu), Princeton (cs.princeton.edu), Oxford (www.cs.ox.ac.ou), and Stanford (cs.stanford.edu). The motivation behind this choice is related to our competence in manually labelling pages belonging to this domain. This was necessary in order to create a ground truth for the evaluation of the clustering results. The experimental evaluation is conducted to answer the following questions: (1) which is the real contribution of combining content and hyperlink structure in a single vector space representation with respect to using only either textual content or hyperlink structure? (2) Which is the real contribution of exploiting web pages structure (i.e. HTML formatting) and, specifically, the role of using web lists to reduce noise and improve clustering results? In Table 1 the dimension of each dataset is described. In particular, to correctly analyze the contribution of web lists in the clustering process, we compare only the web pages extracted both by crawling websites using web lists and by traditional crawling (first column of Table 1). Moreover, we report the dimension of the edge set obtained with traditional crawling (second column) and crawling using web lists (third column). Finally the last column describes the number of clusters manually identified by the experts.

We evaluated the effectiveness of the approach using the following measures:

  • Homogeneity [20]: each cluster should contain only data points that are members of a single class. This measure is computed by calculating the conditional entropy of the class distribution given the proposed clustering.

  • Completeness [20]: all of the data points that are members of a given class should be elements of the same cluster. It is computed by the conditional entropy of the proposed cluster distribution given the real class.

  • V-Measure [20]: harmonic mean between homogeneity and completeness.

  • Adjusted Mutual Information (AMI): it is a variation of the Mutual Information MI. \( MI = \sum _{i \in K} \sum _{j \in C} log \frac{ P(i,j)}{P(i)P(j))} \) where C is the set of real classes, K is the set of learned clusters, P(ij) denotes the probability that a point belongs to both the real class i and the learned cluster j and P(i) is the a priori probability that a point falls into i. MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared. The Adjusted Mutual Information represents an adjustment of this metric to overcome this limitation.

  • Adjusted Random Index (ARI) [14]: it represents a similarity measure between two clusterings by considering all pairs of samples and counting the pairs that are assigned in the same or different clusters in the predicted and true clusterings. \( RI = {(a + b)}/{\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \) where a is number of pairs of points that are in the same class and learned cluster and b is number of pairs of points that belong to different class and learned cluster.

  • Silhouette: it measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

In order to respond to our research questions we ran our algorithm with different configurations:

  • Text. We generate a vector space representation, having dimension \(m=120\), using only web pages’ textual information;

  • RW-List. We generate a vector space representation of size \(d=120\) using only hyperlink structure extracted by crawling the website using web lists. We set \(\alpha = 1\), \(rwrLength = 10\) and \(dbLength = 100\,k\);

  • RW-NoList. We generate a vector space representation of size 120 using only the hyperlink structure obtained with traditional crawling. We ran rwrGeneration with the same parameters of RW-List;

  • Comb-Lists. We combine, as defined in Sect. 2.4, the content vector of size \(m=60\) and hyperlink structure vector of size \(d=60\) generated by crawling the website using web lists.

  • Comb-NoLists. As in the Comb-Lists, but with a traditional crawler.

Since our goal is not that of comparing clustering algorithms, we set for K-MEANS the parameter K (i.e. total number of clusters to generate) to the number of real clusters, while we set for H-DBSCAN the minimal cluster size parameter to 5. Finally, since at the best of our knowledge there is no work which uses the skip-gram model to analyze the topological structure of websites, we ran both of skip-gram versions (i.e. hierarchical softmax and SGNS) for generating link vectors. Due to space limitations, we report only results for SGNS (setting the window size to 5), which, in most cases, outperformed hierarchical softmax.

Table 2. Experimental results
Table 3. Wilcoxon pairwise signed rank tests. (+) ( (−) ) indicates that the second (first) model wins. The results are highlighted in bold if the difference is statistically significant (at p-value = 0.05). The tests have been performed by considering the results obtained with both hierarchical softmax and SGNS skip-gram models.

Table 2 presents the main results. In general, the experiments show that best results are obtained combining textual information with hyperlink structure. This is more evident for Illinois and Oxford websites, where content and hyperlinks structure codify complementary information for clustering purpose. However, for the Stanford website using the textual information decreases the clustering performance. The importance of combining content and hyperlink structure is confirmed by the Wilcoxon signed Rank test (see Table 3). This behaviour is quite uniform for all the evaluation measures considered.

For the last research question, results do not show a statistical contribution in the use of web lists for clustering purpose (see Table 3). This because analyzed websites are very well structured and poor of noisy links. This can be observed in Table 1, where there is not a valuable difference in terms of edges number between the real web graph and the one extracted using web lists. However, as expected the Completeness is higher for Comb-Lists, confirming that clusters have higher “precision” in the case of crawling based on web lists.

4 Conclusions and Future Works

In this paper, we have presented a new method which combines information about content, web page structure and hyperlink structure in a single vector space representation which can be used by any traditional and best-performing clustering algorithms. Experiments results show that content and hyperlink structure of web pages provide different and complementary information which can improve the efficacy of clustering algorithms. Moreover, experiments do not show statistical differences between results which use web lists and results obtained ignoring web page structure. As feature work we will run our algorithm on different domains and less structured websites in the way to observe whether web lists are really useless in the web page clustering process.