Keywords

1 Introduction

The semi-structured, dynamic and heterogeneous nature of websites make information classification increasingly challenging [1, 2]. As a result, even the most versatile search engines provide lesser accurate results to very specific information sought on the web. In fact, the current search engines return the ranked list of documents depending upon textual similarity, together with an independent measure of each web page’s importance [3]. In such cases, the search outcome is a myriad of web pages, which then further requires more user defined search filters, thereby making the search task more difficult. For more complex cases, it may even require that the user would need to visit each web page and apply a manual search for the desired information.

However, dealing with genres pose two important aspects, (i) What are most important and relevant genres to be considered? and (ii) and what factors explicitly help distinguish the genres. In general, the complexity of a web page is heterogeneous in nature, i.e., despite prescription of what is genre, it may be found that genres tend to overlap and mix [4, 5]. For instance, the Google search engine has tabs for ‘Maps’, ‘News’, ‘Images’, ‘Videos’ etc., to segregate content according to presentation style, rather than topic. However, the web page information are not classified according to genres, and as a result the user is unable to specify the category of search content style like ‘Report’ or ‘Wikipedia’ or ‘e-Commerce’, which may be more specific in accordance to the search interest itself. Thus, classification of web pages into genre is a challenging task as the information sought is specific to certain appropriate features that describe the web page in context of a genre.

However, beset with difficulties the need has been realized to have an extra classification scheme, particularly in terms of genre. As of date, the web genre classification distinguishes between pages by means of their style, presentation layout, form and meta-content features rather than topic. Genre adds an extra dimension to the classification of web pages, with improved search results. In order to classify the web pages we require specific features through which the genres can be classified. This forms the objective of the current work.

Assuming that a set of independent HTML metrics can be identified and that a combination of these metrics can aptly describe the genre of a given website, we propose that there exists a threshold value associated with each metric, on the basis of which one can classify the web pages. Provided that there exists such a scheme, and that the underlying methodology is simple, it may be then guaranteed that information retrieval from web would be fast and accurate and be quite simpler to handle large databases. As a case study, we try to identify websites related to “Travel and Tourism” and “Social Media” from a sample space containing genres such as “E-Commerce”, “Social Media”, “Entertainment”, and “News”. Using statistical analysis of the data distribution, a systematic methodology is being developed for identifying the thresholds of the web metrics by exploiting inherent internal characteristics of the HTML metrics. The thresholds of selected web metrics are then utilized for building models to predict genre of website using machine learning techniques.

The remainder of the paper is organized as follows. Section 2 provides a brief review of the related works in the estimation of thresholds in software engineering. Section 3 explains our threshold-based feature selection methodology. In Sect. 4, we show our results which include the statistical description of the metric data distribution, the classifier performance values against a range of values. From the trend that follows, we find that a reasonable threshold associated with each web category is the value corresponding to twice the median absolute deviation. In Sect. 5 we present the limitations of the work. Finally, we conclude the work in Sect. 6 and suggest improvements for future studies.

2 Literature Survey

The web genre classification depends upon the input metrics, ma- chine learning technique and the genre class. In earlier research work, web pages were considered to represent a single web genre [6, 7] for web genre classification. In contrast with this assumption, many scholars and researchers argued that a single genre classification scheme is inappropriate for web pages [8,9,10,11].

In one of the original attempts, Crowston and Williams [12] from a set of randomly selected web pages, with certain set of objectives identified, proposed four types of genres - Reproduced genres, Adapted genres, Novel genres and Unclassified web pages. The study revealed that genres cannot be simply cached and stored in a repository, but evolves. Similar credence was supported by Shepherd and Watters [6, 7]. The latter authors introduced a new terminology “Cybergenre”, which is currently popular as “web genre”. Accordingly, the genre is characterized by a base triplet namely, {<content>, <form>, <functionality>}. While both <content> and <form> represented the traditional genres, <functionality> defined the capabilities offered by the web. Significantly, along with the functionality genre, an important attribute was soon realized, i.e., based on the use of hypertext and/or HTML. Each hypertext corresponded to a genre. It may be noted that although a website may be a collection of web pages, the genre analysis is basically done for the entire website [13, 14]. A super-genre classification of websites [15] was done by using structure, content and their combination to improve the classification accuracy.

Overall, the research work discussed above is in agreement that structure and functionality attributes of a web page represent useful information which can be used to identify the genre of a website. Therefore, we have focused on the quantitative web metric set of <Structural> and <Functionality> attributes represented by text formatting, navigation and external object HTML tags.

In general, the threshold in software systems can be estimated from statistical deductions and mathematical models. The statistical methodology provides qualitative thresholds, however, to improve the validity of the results it is important to study the relation between the data characteristics, underlying assumptions and nature of the problem. Here we briefly review the threshold estimation studies based on statistical inference for software.

The study conducted by Erni and Lewerentz [16] estimated the threshold to be in the range of statistical mean (µ) and standard deviation (σ), represented as T ± = µ ± σ, assuming data to be distributed normally. However, the technique assumed the input metric data to be normally distributed. Usually such distributions are seldom common in software projects and hence the applicability of the technique is limited. The work by French [17] included Chebyshev’s inequality theorem along with µ and σ for threshold calculation but distribution nature of data was again not considered and the methodology suffered to the data outliers. The recent work of de Siqueria et al. [18], have suggested three similarity thresholds, using arithmetic or the weighted mean, k-means clustering and silhouette coefficient maximization, for the genre aware focused crawling.

Shatnawi [19] used ROC characteristics to identify threshold values and analyzed its association with different error severity levels. The relevant threshold values were found for high and medium risk categories of ordinal classification but could not find practical threshold values for binary classification. In a following study [20], the author calculated the thresholds corresponding to the C&K metrics using Bender’s approach [21] based on logistic regression and it was found that risk levels can be used to identify metric thresholds. Similarly, Malhotra et al. [22] also used Bender’s approach to calculate the metrics threshold and determined the effects of threshold on change prediction with inter-project studies. Their results showed that the transferability of the threshold is limited rather to a narrow confidence interval in inter-project comparisons. In a more recent work, Shatnawi [23] proposed data transformation method to reduce skewness in the data and the threshold values were estimated using the statistical parameters such as μ and σ, similar to the works of Erni et al. [16]. However, what limits the underlying methodology is the shift of values by a constant value prior to the data transformation. Alves et al. [24] investigated data distribution properties of object oriented metrics to derive threshold values and the estimated metrics threshold values were insensitive towards data outliers. Similarly, Ferreira et al. [25] statistically analyzed the data to calculate the threshold range of certain metrics. The authors found that most of the metrics followed a heavy-tailed distribution and argued that a general threshold could not be applied to the object oriented software projects. On the other hand, Hussain et al. [26], compared the effect of thresholds derived using Bender’s approach and those mentioned by Alves et al. [24] and concluded that thresholds cannot be generalized for all the systems due to variation in data characteristics.

The studies discussed above have emphasized the importance of the data characteristics and statistics to be considered before estimating the thresholds. Hence, in this work we estimate the threshold of web metrics using the statistical measures of central tendency after analyzing the data distribution. The threshold estimates are used for categorizing the websites according to their genre.

3 Methodology

The methodology we follow in this work is schematically shown in Fig. 1. The Web Metric Collection and Reporting System (MCRS) [27], crawls URL to collect HTML, NLP and text metrics for web genre classification. The HTML metric collector extracts all the links in the web page and collects various HTML web metrics namely, Text Formatting tags, document structure tags, external object tags, instruction and navigation tags. As highlighted in [28], the combination of lexical, functional and structural attributes shall be used for genre classification. Therefore, we have used the “Structural” features of web page represented by text formatting <br>, <div>, <li>, <p>, <span>, <ul> and navigation <a> tags, while the external object tags <img>, <script> are used to define “Functionality”. Therefore, these nine web metrics, listed in Table 1, constitute the independent variables in the study, which are used to categorize the website as “Travel and Tourism”, “E-Commerce”, “Social Media”, “Entertainment” or “News”.

Fig. 1
figure 1

Schematic representation of the methodology adopted in the study

Table 1 The HTML metrics used in the study

The set of nine metrics and website category serve as input to the statistical analysis and metric distribution module. The statistical parameters of central tendency for sample space including range (Rmax, Rmin), mean (µ) and median (xm) are calculated in this module. Also the histogram plots are investigated to identify the distribution characteristics of the input metric space. These statistical parameters along with the sample space serve as input to the Threshold Metric Module (TMM), which estimates the threshold values for the “Travel and Tourism” and “Social Media” website category.

The website category prediction model is built using Naive Bayes classifier, with input from the TMM and renders the classification performance measure of the web category in terms of the AUC values. The selection of the Naive Bayes algorithm is not only because of the common use in data mining applications, but also due to its reliable performance for small dataset [29]. The default parameter settings were used for the learners as specified in Weka. A priori, AUC is chosen as the performance measure, due to the inherent class imbalance observed in the dataset. By definition, AUC is the probability a classifier ranks a randomly chosen positive instance higher than its negative instance counterpart. In the Receiver Operating Characteristics (ROC), the magnitude of AUC varies in between 0 and 1. Note that ROC analysis helps in decision making, by relating the performance and non-performance of a classification model.

4 Results and Discussion

4.1 Data Characteristics

In Table 2 we describe the statistics of the metric data, the latter which includes all five web categories. The range of the metrics with its upper limit, designated as Rmax are shown along with the measures of central tendency, i.e., the mean (µ) and median (xm).

Table 2 The statistical description of the selected HTML metrics for the sample space, travel and tourism and social media categories

It is evident from Table 2 that the range of the metrics are quite different. We also find significant difference in the mean and median values thereby inferring a non-normal distribution of the metrics. All metrics distribution are found positively skewed, since µ > xm. Empirically, by considering the difference (µ − xm), as a measure of skewness, the data reveal that the distribution associated with the <div> and <a> metrics are relatively more skewed, while <script> is least skewed.

We first attempt to construct the threshold parameters for the “Travel and Tourism” web category. For the same, we first analyze the statistical parameters associated with the web category with respect to the sample space. In Table 2, we show the statistical description of the metrics associated with the “Travel and Tourism” web category. The overall characteristics of the sub-space remains similar to that of the sample space, i.e., the metrics distribution are skewed with mean being greater than the median. Besides, we also note that the category resides well inside the sample space with no range maximum of any of its nine metrics with that of the sample itself.

The basic statistical description of the data pertained to the “Social Media” web category is shown in Table 2. It is found that the distribution of data are very different from <script> metrics and spans the entire range in the “Social Media” category, which is not the case for “Travel and Tourism”. The <div> metrics range from [0, 472] in “Travel and Tourism”, but shows a wider range of [0, 1229] in the “Social Media” category. We also note that four out of nine metrics, namely <br>, <div>, <p> and <img> representing the “Social Media” category are spread all across the entire range, which is in contrast to the data distribution associated with “Travel and Tourism” category.

For a better understanding of the category wise metric distribution with respect to the sample, and also among the five categories, we analyzed the data in terms of frequency plots as shown in Figs. 2 and 3. The wide difference among the web categories in the metric space is very evident. Not only that we find the frequencies associated with the metric values to be very different, but also that certain metrics distribution were found to be continuous for some categories, while for others it looked non-uniform and discontinuous. For instance, for <div> in “Travel and Tourism” the frequency of data in the range [0, 50] was found to be 17, while in “Social Media” it was determined to be 30. On the other hand, both metrics <li> and <ul> shows a continuous and decreasing trend with increasing range in the “Social Media” category, while in “Travel and Tourism”, the distribution is discontinuous. In fact, based on our inter-quartile analysis, we find that the data in the “Travel and Tourism” category for metric <li> in the range [300, 350] and that for <ul> in [50, 60] are representation of being outliers. Thus, statistical analyses show that the web categories in the sample space are widely different in terms of the metrics that define each category.

Fig. 2
figure 2

Histogram representation of the sample space (shaded black), with data on “Travel and Tourism” (shaded grey) projected, of the nine metrics used in the study

Fig. 3
figure 3

Histogram representation of the sample space (shaded black), with data on “Social Media” (shaded grey) projected, of the nine metrics used in the study

These statistical observations of the metric data incite to look for a threshold value, or a set of values, which differentiate each web category in a given sample space.

4.2 Threshold Calculation

The problem at hand, therefore is to determine the threshold upon which one can classify the web categories from a given sample space. For this, one need to have an initial guess to the threshold value, upon which the performance measure of certain chosen category can be calculated. Further assuming that there exists a unique set of threshold parameters to determine the threshold, we vary the guess parameter in increments so as to obtain an optimal value of performance. As obvious, the range is quite different for all metrics within a given web category, and also among various categories. Thus, the minimum and maximum values of a metric distribution are not very good statistical parameter to use, since they can fluctuate greatly from sample to sample. Besides, the distribution as mentioned above have significant deviations from that of a normal distribution.

As a result, we argue that neither the limiting range parameters nor the mean value of the distribution is a good choice to be considered as an initial guess for the threshold. For a simple, nonparametric statistic to represent variability of a skewed data, we therefore consider median (xm) as our reference measure of central tendency, which also forms as our initial guess to the threshold value. In analogy to the role of standard deviation (σ) in normal distribution. A widely used parameter for variance in skewed dataset is the statistical quality referred as “Median Absolute Deviation” (MAD). Much similar to the relevance of μ ± 2σ in normal statistic dataset, here we use xm ± (2 × MAD) as a range for the calculation of the threshold value. Mathematically, MAD is defined in Eq. (1) as,

$$ {\text{MAD}} = {\text{median}} \times (|X_{i} - {\text{median}}_{j} (X_{j} )|) $$
(1)

By definition, MAD represents a measure of statistical dispersion. For non-normal dataset, MAD is a robust estimator of scale than the conventional variance or standard deviation. MAD also is a much better statistical quantity for distributions that have neither mean nor variance, such as that for Cauchy distribution, and thus includes as a universal statistical quantity for any metric space irrespective of its nature. Furthermore, an advantage of using MAD as a statistical estimator is due to its insensitiveness towards outliers. We define threshold as a boundary which differentiates radically different regions. In this context, we anticipate a change in the variation of AUC as a function of xm + , where δ is an increment and “n” a positive integer. To determine the maximum range up to which values will be varied, 2 × MAD is considered.

4.3 Website Category Prediction Model Using Threshold

The median and 2 × MAD values of all metrics are first calculated (Refer Table 2). To calculate the performance of web category prediction model using thresholds, we transform each metric into a binary form. The metric values below threshold are transformed to “zero” and those above as “one”. Thereafter, the binary transformed dataset is fed as input to the Naive Bayes algorithm with stratified remove folds as class balance technique. The corresponding AUC value is computed which are shown in Table 3.

Table 3 The performance of the web category prediction model with AUC measure, with and without threshold

The AUC with median as threshold for the transformed data is determined to be 0.66, which is relatively lower to the AUC computed for the original (untransformed) dataset, i.e., 0.71. Note that for the value of xm + , which shows a characteristic change in the AUC value would be referred as the threshold value associated with the web category.

In Fig. 4 we show the stacked line graph of the performance measure variation with respect to xm + , obtained for both “Travel and Tourism” and “Social Media”. The results are shown in this form mainly because stacked line representation enable us to capture the trend in the variation with variable threshold range assumed in the calculation. Besides, since such a graph is cumulative at each point the data does not overlap. It is very evident that the performance measures of both web categories are very similar. With increase in the δ, the graph initially decreases by 16 and 7% for “Travel and Tourism” and “Social Media”, respectively. Thereafter, we find a gradual increase in the performance measure up to xm + 4δ, beyond which the values saturate. Behold, the definition of threshold as the boundary which separate two region in the variation of performance, we find that the boundary in this study points to xm + 5δ. Interestingly, this also correspond to the (2 × MAD) value. Thus, with the observation that two very different skewed metric data distribution associated with “Travel and Tourism” and “Social Media”, not only exhibiting similar trend but also inferring (2 × MAD) as threshold help us formulate the following hypothesis; i.e., there exists a close relationship between the threshold value and statistics in the web category distinction, with (2 × MAD) value corresponding to the threshold itself.

Fig. 4
figure 4

The stack plot showing the variation of relative performance measure of “Travel and Tourism” and “Social Media”. Note that for the exact AUC measure corresponding to “Social Media”, the value corresponding to a given xm + , has to be subtracted from the “Travel and Tourism” value

5 Threats to Validity

One of the basic question is how well the present experiment has been done. In this perspective, one of the confounding issue is in the selection of the input metrics. Whether or not the chosen metrics forms a complete metric space, and/or whether there exists a linear dependence between the metrics is an internal threat to the validity of the results. As a check, it would suffice to use feature selection techniques and calculate the optimal threshold value.

In this study, we have used the Naive Bayes algorithm. For a wider understanding, the use of a single machine learning algorithm could be a possible threat to the conclusion validity of this study. However, As mentioned above Naive Bayes has been found to yield reliable results for smaller dataset and also yet being a simple model the algorithm has found numerous applications providing high performance for a large variety of datasets. However, as a future work we will be evaluating the performance with several other machine learners such as Bagging and Boosting algorithms.

The observations following the study is limited in generalization as to similar studies which may span other networking sites. This pose a possible external validity threat. For instance, the design of websites can significantly depend on the culture and tradition of various communities and public across the globe. To minimize these local effects, it is important to collect data from various other networking sites across the globe and investigate to what extent the results can be generalized.

6 Conclusion

Identification of proper web genres are expected to ease classification at both organizational and at user level. Given that the evaluation of web sites would thereby become plausible at lower cost, development of web genres are becoming increasingly important for the developers to adopt measures so as to ease search queries. In this regard, we propose a model based on threshold to distinguish various web categories based on the statistical measures of central tendency. Since the metrics that define the metric space are found skewed, we guess that the threshold would be more related to the median value than the more widely used mean. Setting the definition of threshold as the boundary that differentiates the classification performance rendered by a machine learning algorithm, we vary the threshold value in increments, from median towards the skewed part of the spectra. The trend as captured by the AUC values clearly shows that beyond certain optimum value, the magnitude of the performance measure saturates. We argue that the set of metric values that put the magnitude of the performance in saturation can be termed as the threshold. In statistical realms, our study shows that the threshold is (xm + 2 × MAD), where xm and MAD represents the median and “median absolute deviation”, respectively. In analogy with the standard deviation which is commonly used for dataset with normal distribution, we conclude that the proposed threshold estimate evaluated lies within 95% confidence interval. The use of Median-Absolute-Deviation has never been proposed in any earlier works related to threshold determination in website categorization, and hence require more experiments over a wider range of website classifications.