Keywords

1 Introduction

Nowadays, searching on the web becomes more and more challenging due to the unprecedented amount of data and small length of queries. Thus, understanding users’ real intent below the query plays a key role in improving the quality of web search. Many commercial search engines provide query suggestion to predict users’ intent based on previous issued queries. However, not only the previous query can be adopted as the materiel of query prediction, the previous viewed web page might also give users innovation to issue a related query. Few researches have paid attention to it.

A previous-viewed webpage may make a user to think of something, but the topic of what users might think of is very diverse. For example, suppose there is a user who has viewed the homepage of SIGIR’2011 (see Fig. 1) now [1]. After viewing the web page, the user might want to know more about the conference. Moreover, he or she might want to know more about the city where the conference was held, like the weather, the scenic spots, or the history of the city, etc. Another possibility for the user is seeking for the information of the building in the image. In addition, he or she might think of a SIGIR paper that he or she read before, or think of some other conferences like WWW or CIKM. There are many possibilities.

Fig. 1.
figure 1

Screenshot of the SIGIR 2011 web site (www.sigir2011.org)

However, not every notion leads to a query. We wonder whether there is any viewed web page that trigger user to issue a query. Unfortunately, only the original user who issued the query knows whether the query is triggered by the web page or not. Therefore we reasonably assume that a query is triggered if a user issues a related query after viewing a webpage. With this assumption, we observed that this phenomenon really exists in the log. For instance, a user viewed the Wikipedia page of Andy Carroll, an English footballer, and issued the query “Andy Carroll” right after viewing the page. Another example is that a user who viewed the Wikipedia page of American Craftsman, described as a style of domestic architectural design, and then issues the query “mayo furniture”, which is a furniture company in America.

Since the topics of notions from a webpage are diverse, the topics of triggered queries are diverse, too. Even an image or a short sentence might interest users to issue a corresponding query. Therefore, it is extremely hard to list out all the possible queries that will be triggered. In our work, we adopt a classifier trained by SVM to predict whether a specific web page will trigger a specific query. Then we show how the classifier could be applied to suggest queries when a user is browsing a web page.

2 Related Works

Many works have been done for understanding users’ intent below the query [24]. White et al. [2] paid attention to the website recommendation given the currently-browsed web page and five different types of contextual information—social, historic, task, collection and user interaction. More specifically, social combines the interests of other users that have browsed the currently-browsed webpage; historic is the interests of the current user; task contains the web pages which shares the same queries with currently-browsed webpage in search engines; collection contains the web pages which link to the currently-browsed webpage by hyperlinks; user interaction contains the current users’ recent interaction before browsing the currently-browsed webpage. Shen et al. [3] understood users’ intent from sequential search data using their model, Sparse Hidden-Dynamic Conditional Random Fields. Given a session of a server-side search log, which is composed of several search and click events, the model predicted which label the user’s intent in this session belongs to. In addition, eight different labels was defined and labeled in this work. Cheng et al. [4] displayed a prediction of real intent below users’ query. Then the prediction was used to rank queries following a given webpage for recommendation queries from the webpage. In addition, in their analysis, there are 23.8 % “browsing → search” patterns.

The contextual information has been widely used in information retrieval field. As mentioned, five kinds of contextual were adopted in [2]. Cao et al. [5] took the immediately preceding queries into account as context. There are two stages in the system. At the offline stage, queries in log were summarized into concepts and a concept sequence suffix tree was built by aggregating the sessions in log. At the online stage, the concept sequence suffix tree was adopted to suggest queries. He et al. [6] proposed a query recommendation method which predicts the next query by giving a sequence of preceding queries.

For recommendation, it is also possible way to retrieve important terms from the webpage. Therefore some woks [79] of summarization and key word identification are reviewed in the following. Sun et al. adopted both plain text and click through data to improve summarizing [7]. Grineva et al. [8] proposed a novel way to extract key terms by modeling a document (webpage) into a graph. In the graph, nodes represent terms, and edges represent the semantic relation between terms. Since the authors claimed that important terms tends to cluster with other terms densely. A graph community detection algorithm was adopted to partition the graph into sub-groups, and a criterion function is employed to select the group that contains key terms.

3 Problem and Methods

In this section, we describe each step in detail. Figure 2 shows the whole structure of our work. First, Trend Micro log is cleaned by removing those events with a URL directing users to a destination that is not a web page. Then we retrieve P → Q patterns, which P denote a webpage, and Q denote a query. Those P → Q patterns are labeled and cleaned by human judges to further remove illegal webpage and query and classify the patterns. Cleaned and labeled P → Q patterns are used to train the predictor with our features and get the result.

Fig. 2.
figure 2

Workflow

3.1 Problem Specification

Given a pair of a web page U and a query Q, representing that a user had viewed the web page and then issued the query subsequently, identify whether the query Q is triggered by the webpage U. It is a binary classification problem which classifies an U → Q pair into triggered type or non-triggered type (refer Table 2).

3.2 Data Set

The client-side log we used to retrieve users’ intent was crawled by Trend Micro. User actions observed at client side would be sent and recorded in the Trend Micro server. In this work, only those records of web browsing and searching histories were extracted. The log is recorded originally for security issues. That is, for each URL-viewing event, the URL would be sent to the server by anti-virus software to check whether the URL is safe or not. For a URL which has be checked and claimed to be safe in recent time, the anti-virus software may not send it to the server again. Thus the URL would be not recorded in the log.

However, The Trend Micro log recorded most actions in client computers which had installed the anti-virus software of Trend Micro. A one-hour log is collected from 0 A.M. to 1 AM., October 8, 2010. The languages used in this log are mainly English, Chinese, and Japanese.

Each entry in the log contains a unique user ID, timestamp, and the URL of a viewed web page. Those entries with a URL directing users to a Google search engine result page (SERP) are defined as query entries, denoted as Q, otherwise page-viewing entries, denoted as U. With the definition, a search section may look like

$$ U \to U \to Q \to U \to Q \to Q \to U \to U \ldots $$

Note that a session ends if the user idles for more than thirty minutes. Moreover, we only collected those sessions with at least one search entry.

All sessions can be categorized into either query sessions or URL sessions. A Query session is a session that started with a query entry, and A URL session is a session starting with a page-viewing entry.

Table 1 shows the statistics of the one-hour log we use. In the table, the average number of terms in a query is 1.08298. Here the terms in a query are split by the original user using spaces or plus signs. No further word segmentation method is adopted.

Table 1. Statistics of our dataset

As in Sect. 3.1, we focus on whether a query is triggered by its previously-viewed webpage. Thus, only URL sessions are adopted to train the predictor. The major reason is that URL sessions may capture users’ behavior better when users are surfing on the Internet. Compared with URL sessions, the user behavior revealed in query sessions is more likely to attempt satisfying a pre-existing information need.

3.3 Cleaning

Since the log recorded by Trend Micro contains almost every URLs that users accessed, there are many URLs that is not a real web page in the log, including document files (“.doc”, “.txt”), image files (“.jpg”, “.gif”), executable files, ad, frame, java script… etc. The major goal of cleaning is to remove URLs that are not linked to a web page.

There are two stages of cleaning in our work, cleaning illegal URLs and cleaning illegal web pages/queries. While in the stage of cleaning-illegal URLs, we removed the entries with URL that is impossible to be a web page. More precisely, we removed a URL that ended with a filename extension which is neither “.html” nor “.htm”.

However, the rule is not effective enough to remove all entries in which the URL is not a webpage. Therefore, we adopted a further cleaning by human judge, called cleaning illegal web pages/queries. In this stage, we combine the labeling and advanced cleaning together. Firstly, a query and its previous web pages were retrieved until another query or the head of this session is reached. If there are more than five previous web pages, only the last five web pages are kept. After that, we got a pair composed of at most five URLs and a query. The judges were asked to select the latest valid URL. If there was no valid URL, the pair would be labeled as broken. In addition, a pair would be labeled as broken if its query contains nothing or can’t be decoded into readable characters sequence.

3.4 Labeling

In this subsection we describe our labeling system. At the top, the ID and result of this pair is showed. Next, the five URLs, the query and the choice of relation are displayed. The judges are requested to select the latest valid URL, and then identify the relationship between the selected URL and the query. Each pair is classified by a judge. In this work, there are 5 judges aged between 22 and 26. Judges can easily view any content of a URL by simply clicking the link. In addition, since the query of a pair is retrieved from the URL of a Google search engine result page and there are not only English queries in our dataset, some of queries are encoded by RFC 1738 [10], we also provide the query after decoding in the system to make judge easily.

For an unlabeled pair, both the selected URL and relationship is “have not been labeled”. After being labeled, the selected URL must be one of URL1, URL2, URL3, URL4, or URL5. For a broken pair, the selected URL is labeled as one of URL1–URL5, and the relationship is still “have not been labeled”. Unlabeled and broken pairs are excluded from the training and testing of our predictor. In our dataset, there are 1907 pairs from URL sessions. We labeled 1416 pairs from URL sessions, and 325 of them are broken.

Pairs that are neither unlabeled nor broken can be classified into three categories—triggered, non-triggered, and unknown. The labeling categories and instructions are showed in Table 2. Moreover, a triggered pair can be categorized into main topic or non-main topic; we will discuss the difference between them in Sect. 5.2.

Table 2. Labeling categories of a valid pair

Most unknown pairs are labeled because the webpage that user had read require logging in to access more information. Another common reason is the query cannot be understood by judges. Some of queries contain just one letter of the English alphabet or a single number.

3.5 Basic Feature Engineering Method

After being labeled, a labeled pair contains a selected webpage U, a query Q, and the relation between them. In this section, 18 features are introduced and explained. In addition, since there are frequency features, we propose and compare three different ways to compute the “frequency” while there are multiple terms in a query in Sect. 0. Then, these features are employed to train and test by LIBSVM [11].

Features. First, the features used are listed in the following:

  • F1: Rank (exist: 0–100, not exist:101)

  • F2: Top 10? (yes: 1, no: 0)

  • F3: Top 20? (yes: 1, no: 0)

  • F4: Top 30? (yes: 1, no: 0)

  • F5: Top 50? (yes: 1, no: 0)

  • F6: Top 100? (yes: 1, no: 0)

  • F7: Percentage of noun terms in query

  • F8: Log of web page length in byte

  • F9: Query frequency in the URL of web page

  • F10: Query frequency in the webpage

  • F11: Query frequency in the title of the webpage

  • F12: Query frequency in the body the webpage

  • F13: Query frequency in the first paragraph of the webpage

  • F14: Query frequency in the last paragraph of the webpage

  • F15: Query frequency in the first sentence of any paragraph of the webpage

  • F16: Query frequency in the last sentence of any paragraph of the webpage

  • F17: Percentage of the query terms that exists in the title of any Wikipedia articles

  • F18: cosine similarity between the web page and the query

Among those features, F1 – F6 are ranking features, designed for capturing the similarity and relations strength between U and Q. When the user issues Q, the value of F1 is the rank of U if U is at the top 100 results, otherwise 101. F2 – F6 asked whether U is at the top 10/20/30/50/100 results when Q is issued. F7 reports the percentage of noun terms in Q. F8 is the logarithms of the length of U to base 2, which the length is measured by number of bytes.

F9 – F16 is frequency features. F9 reports the frequency of Q in the URL of U. F10 is the frequency of the U, including in the title. F11 is query frequency in title of webpage. F12 is the query frequency in the body of the webpage, and the HTML tag <body> stands as the borderline of HTML body part. F13 – F16 are the frequency of Q in first paragraph/last paragraph/first sentence in any paragraph/last sentence in any paragraph of U. These four features are proposed since descriptions that appear in above locations might be more important than in other locations. The html tag <p> is employed to separate different paragraphs, and the period (.) is adopted to separate sentences. F17 reports the percentages of terms in Q that exists in the title of any Wikipedia article. F17 is reported because the terms appeared in Wikipedia might be more difficult to understand or interesting, and make users tend to query it. F18 is the cosine similarity between term frequency (TF) 12 vectors of Q and U. Since most queries are short and sparse, the Google search result page of top 100 results is crawled for representing the query. Thus, F18 computes the cosine similarity between U and snippets of Q.

Handling Multiple Query Terms. Usually, a query contains several terms. It is hard to match the whole query in the document. Three different ways are presented to aim at this problem.

Consider a query Q = {t1, t2, t3… tn} and a document D,

  • Maximum

$$ {\text{maximum}} = \mathop {\hbox{max} }\limits_{{{\text{t}} \in {\text{Q}}}} ({\text{frequency}}_{\text{t}} ({\text{D}})) $$
  • Average frequency

$$ average\,frequery = \frac{{\mathop \sum \nolimits_{t \in Q} {\text{frequency}}_{t} ({\text{D}})}}{n} $$
  • Average appearing possibility

$$ average\,apperaing\,possibility = \frac{{\mathop \sum \nolimits_{t \in Q} {\text{I}}_{t} \left( {\text{D}} \right)}}{n} $$

Where

$$ {\mathbf{I}}_{\varvec{t}} \left( {\mathbf{D}} \right) = \left\{ {\begin{array}{ll} {{\mathbf{1}}, \, \varvec{frequency}_{\varvec{q}} (\varvec{D}) > 0 } \\ {{\mathbf{0}}, \, \varvec{frequency}_{\varvec{q}} (\varvec{D}) = {\mathbf{0}}} \\ \end{array} } \right. $$

These three methods for computing the frequency have been evaluated on the dataset. However, the accuracies of them are almost the same. Among them, the maximum method performs slightly better than others. Therefore, the maximum method is adopted in the following experiment.

3.6 Context-Aware Method

In this method, previous queries are considered. For the query of each pair, the last previous query within the same session is detected, called Q’. Pairs without Q’ (i.e. the query is first query entry in the session) are discarded. Subsequently, one new feature is added.

  • F19: cosine similarity (Q, Q’)

Aiming at the sparseness of queries, similarly as mentioned in Sect. 0, the Google search result page of top 100 results are crawled. The term frequency vector of the page is adopted as the vector of the query.

3.7 Evaluation

To evaluate the performance of the predictor, the accuracy measurement is employed. Moreover, 5-fold cross validation is performed to validate the accuracy.

The accuracy is defined in the following:

$$ {\text{accuracy}} = \frac{\# tp + \# tn}{\# tp + \# fp + \# tn + \# fn} $$

Where

 

Gold standard (Labeling result)

Triggered

Non-triggered

Predicting result

Triggered

tp

fp

Non-triggered

fn

tn

4 Experimental Results

4.1 Experiment Setting

The statistic of labeled results is shown in Table 3. Labeled result distribution

Table 3. Labeled result distribution

For balancing the number of pairs from triggered and non-triggered, we sampled 100 triggered pairs and 100 non-triggered pairs as our data set.

4.2 Results of Basic Feature Engineering Method

Baseline. The baseline adopted is described as follows: If the percentage of appearing query terms is larger than a threshold T, the pair belongs to triggered type, otherwise it belongs to non-triggered type. Table 4 provides the baseline accuracy in different threshold.

Table 4. Baseline result of selected data set at different threshold

With T is equal of 20 %, the accuracy has a peak at 67.5 %. Thus T is set to 20 % and the accuracy of the baseline is 67.5 %.

Accuracy. Table 5 displays the accuracy of the basic feature engineering method compared with the baseline.

Table 5. Accuracy of basic feature engineering method

Therefore, we get 17.78 % improvement from baseline using following formula.

$$ improvement = \frac{method\,accuracy - baseline\,accuracy}{baseline\,accuracy} $$

Feature Effectiveness. For testing the effectiveness of each feature, we remove every feature one by one, and use the accuracy change (AC) to measure the effectiveness.

$$ Accuracy\,Change\,(F) = \frac{feature\,accuracy - overall\,accuracy}{overall\,accuracy } $$

Feature accuracy change denotes the new accuracy after removing the exactly one feature F that we want to test. Overall accuracy denotes the accuracy with all features (Table 6).

Table 6. Feature effectiveness

4.3 The Context-Aware Method

Baseline. The rules of baseline are described here. If the percentage of appearing query terms is larger than the threshold T, the pair belongs to triggered type, otherwise it belongs to non-triggered type. With T is equal of 20 %, the best baseline accuracy is reached at 62.86 %. Thus T = 20 % is selected (Table 7).

Table 7. Baseline result of context-aware method at different threshold

Accuracy. A 33.33 % improvement from baseline is reached, and a 5.42 % improvement from basic feature engineering method is reached (Table 8).

Table 8. Accuracy of context-aware method

5 Discussions

5.1 Feature Effectiveness in Basic Feature Engineering Method

The aim of this research is to study the relation between a query and its previous webpage, and to predict the relation between a query and a webpage in an unknown pair. Firstly, the statistics of labeling result indicate that almost 30 % of U → Q pairs are triggered, which is large enough to be focused on. Secondly, with the 18 features in basic feature engineering method, there is a significant 17.78 % improvement in accuracy from 67.5 % to 79.5 % comparing to the baseline.

Most frequency features contribute to the predictor except F11. Recalled the assumption of triggered U → Q in this work is that a user viewed U and issued a related Q. Therefore if Q appeared in U, the judges can easily find the relationships in most cases. Thus F8, the query frequency in the webpage, performs well in the predictor. However, among these frequency features, F9, the query frequency in URL of the web page, performs the best. The reason might be that there are some pairs in which the user issued the domain name as the query when browsing a web page. One possible explanation is the user would like to access the homepage but the hyperlink is difficult to find. In addition, there are some location features (F9, F11, and F13–F16) that focus on where the query term appears in the content of the webpage. F13–F16, designed to capture the frequency in specific location, performs worse than matching the whole webpage (F10), but F13 and F16 performs better than matching the content of the webpage (F12). Unexpectedly, F11, the query frequency in the title of the webpage, does not work well. Because the title of a webpage may be usually short, it is hard to match terms in the title.

Ranking features, originally designed for capturing the strength of relationship between the query and the webpage, are too sparse to be adopted since only top 100 were checked. Other web pages are treated as non-relevant web pages even if the rank is very close to 100.

Percentage of noun terms in the query is not effective since most of query terms are noun no matter whether the query is triggered or not. Moreover, whether a query is triggered is decided by not only the query, but also the webpage. Two pairs with the same query may also be in different types.

Percentage of wiki terms in the query is unfortunately not effective, too. Since no advanced word segmentation were applied, it is possible that a query term is not in title of any Wikipedia articles but its substring is, like “NTUCSIE”. (That is also the reason that whether the whole query is the title of any Wikipedia articles is not adopted as the feature.) Even worse, it is also possible that every term appeared in some Wikipedia articles but the appearing is not related to the topic of the query. For example, there is a real query “Dr. Randolph Capri” in the log, who is a doctor working in California. In fact, there is no Wikipedia article written about him. However, “Randolph” and “Capri” both appear in some titles of Wikipedia articles; therefore the percentage is 66.67 %. This feature introduced too many noises.

5.2 Main Topic and Non-main Topic

The triggered category can be divided into two sub-categories – main topic and non-main topic. If the triggered query is the main topic of the web page, this pair belongs to main topic; otherwise it belongs to non-main topic.

Non-main topics are harder to detect since the frequency may be very low, and the fields of triggered queries may be diverse. Sometimes the query does not appear in the webpage but the query is stilled triggered.

In our dataset, the 100 triggered patterns are composed of 90 main topic patterns and 10 non-main topic patterns.

5.3 Context-Aware Method

Given the last previous query Q’ information, we assume the webpage U is the search result of Q’. The order that user viewed is Q’ → U → Q. With Q’, we could predict those patterns whose similarity between U and Q are small but they are possibly triggered. Therefore, the accuracy is improved by giving the information of previous query.

6 Conclusions

In this paper, we proposed a prediction of triggered queries based on a real client side log. The predictor combines the frequency, the similarity, the webpage length and the rank as features. Among the above features, the frequency and similarity features outperform from others. The locations features also contribute to the accuracy of prediction. Then, the prediction is enhanced by giving the previous queries as context. Moreover, many application may be done by our prediction, such like query suggestions on mobile devices, AD generation for blogs, and implicit hyperlink construction for web structure.