Keywords

1 Introduction

Internet has grown to an extent where any information the user is looking for, is readily available and the challenge is to extract only the most relevant information about a given concept. Be it a search engine or a user, the relevant data needs to be parsed and extracted from the humongous web. Web crawling, used mostly by search engines, are softwares that are used to discover and index websites [1]. Along with establishing relationships between the web pages, crawlers are used to collect and process information that can be used to classify web documents and provide insights into the collected data [2].

The process of crawling and parsing also leads to the extraction of irrelevant information from the web. For example, classical crawlers adopt breadth-first mechanism [3], i.e., searching all the links of a parent link, extract both relevant and irrelevant data from the web, and hence result in wastage of CPU time, memory, and resources. To resolve these challenges topic specific crawlers, also known as focused crawlers [4] are introduced. These are better than classical crawlers in producing accurate data for the given concept.

If a user wants to crawl a website, he first needs an entry point. Back in the early days, users had to submit the website to search engines to tell them it was online. But now users can quickly build links to their website and make it visible to the search engines on the web. A crawler works with a list of initialized links called Seed URLs. These are passed on to a fetcher that extracts the content of the web page of the URLs, which is further moved on to a link extractor that parses the HTML and extracts all the links present in it. These links are passed to a store processor, which stores them, and a page filter that sends all the interesting links to a URL-seen module. The URL-seen module checks whether the URL has already been seen before in this crawling process, if not, it is sent back to the fetcher. This iterative process continues until the fetcher gets no links to fetch the content [5].

The retrieval of information is design-dependent and does not always result in styles that adapt the workflow. On the other hand, smart crawler or knowledge-aware crawler, is context-dependent. Majority of re-crawling techniques assume the availability of unlimited resources and zero operating cost. But in reality, the resources and budget are limited and it is impossible to crawl every source at every point of time. This brings up the idea of providing a mechanism to the crawler with which it can store its experience/knowledge about web pages and use it, when needed. We aim to develop and deploy the knowledge-aware crawler focusing on the context needed by the user. This is done by the integration of various modules from maintaining crawled data in an appropriate structure to finding similarities and building a knowledge base. Then, we contrast it with the traditional crawlers to know the gains in the future crawls.

This paper is further divided into the following sections. Section 2 presents the literature survey. Section 3 presents our Smart Crawl model design and deliberations; Sect. 4 presents the results and discussion; and Sect. 5 presents the conclusion.

2 Literature Survey

Internet documents contain the latest and relevant contents, which is essential to construct an encyclopedia, i.e., textual data and multimedia data such as images, videos, audio, etc. With the help of the dynamic encyclopedia, we can easily crawl through the web pages from the search engine results and find what is needed without the user interaction in filtering and combining data from individual search results. In data science, mining [6] can be the key to finding relevant keywords from millions of web pages without having to read everything.

Irrespective of the industry, annotations tools are the key to automatically index data, synthesize text, or create a tag cloud using the most representative keywords [7]. With automatic annotation extraction, we can analyze as much data as we need. We can manually read the whole text and define key terms, but this takes a long time [8]. Automating this task allows us to focus on other parts of the project. Extraction of annotations can automate workflows, such as adding tags to web content, saving us a lot of time. It also provides actionable, data-driven insights to help make more informed decisions while crawling [9].

Manual procedures to extract statistics from textual information may be challenging for giant duties in which assets are limited, as they usually are. Computer-assisted techniques appear to be a promising alternative: Hence, researchers can complete specific tasks with a great speed. Every manual, automatic, or semi-automatic technique for analyzing textual information has its set of blessings and expenses that fluctuate depending on the venture at hand [10].

Over the period of time, crawlers on numerous criteria like, Parallel and Topical web crawlers have been designed [11,12,13]. General evaluation framework for topical crawlers has been discussed [14]. Measures to improve the performance of focused web crawlers have been deliberated [15]. Crawlers used in developing search engines have been surveyed [16]. Studies have been made on different types of web crawlers [17]. Behaviors of the web crawlers have been modeled [18]. Advanced web crawlers have been discussed [19]. Crawlers have been designed based on inferences and as well by contextual inferences [20, 21].

Finding helpful information on an extensively distributed internet network requires an effective search strategy. Web resources’ spread and dynamic nature pose a significant challenge for search engines to keep your web content index up to date because they must search the internet periodically [22]. There are many technical challenges faced in designing a crawler [23]. Some of them are avoiding visits to the same link frequently, avoiding redundant downloading of web pages and thus utilizing network bandwidth efficiently, avoiding bot traps, maintaining the freshness of the database, bypass login pages and captchas, crawling non-textual content of HTML pages, etc. Following a broad literature survey, we would be addressing the following challenges through this project work, avoiding multiple visits to the same link, avoiding redundant downloading of web pages, and knowing whether to crawl a page or not, before opening it.

3 Crawl Smart Model

This section presents the design and deliberations of the smart crawl. Our objective is to design and develop a knowledge-aware crawler that learns the web with time, by storing the context of the web pages it visits in the form of knowledge and then using it to improve crawling in future.

3.1 Design Principles

The design principles of the crawler are:

  • To design a data structure for contemporary crawler challenges,

  • To annotate crawled data,

  • To establish relationships among the crawled data, and

  • To build a knowledge system with parsed data.

3.2 Model Design

The system model is presented in Fig. 1.

Fig. 1
figure 1

Crawl smart system model

The crawler first looks for the query tag given as input, in the knowledge base. If the tag is mapped to a link, that corresponding link is used to start crawling. Else, a random link belonging to the domain of the input query tag is chosen from the pool of seed links, and crawling is initialized. For every crawled link, the crawler first checks if there are any tags associated with the link in the knowledge base. This step avoids annotating links that were already visited in the previous crawling. If tags are available, it extracts the tags and uses them for the similarity metric. Else, it extracts text from the link and annotates it. Here, a copy of the tags is inserted into the knowledge base for future reference. Upon evaluating the similarity metric, if it is above the threshold, then the link is considered for the next iteration; otherwise, the link is discarded.

3.3 Data Design

The data structure designed for this work has two major components and is represented in Fig. 2.

Fig. 2
figure 2

Data structures to store crawled links: Trie and Hash Table

Trie. It is a rooted tree, where a Parent Node is linked to one or more nodes, called the Child Nodes. The root node has no parent node, and the Leaf nodes have no child nodes. Apart from these properties of a tree, every child node of a Trie has a unique integer ID concerning its parent.

Hash Table. The Hash Table is a map, mapping the node value entering the Trie to a unique key generated by the hash function. The size of the Hash Table is kept dynamic.

3.4 Abstract Data Type Representation

The Abstract Data Type (ADT) Representation of the data structure can be described as shown in Fig. 3.

Fig. 3
figure 3

Abstract data type representation

The objective of this work is to design a data structure that can efficiently store the relationship between the crawled data. Now, the web can be seen as a hierarchy of tens of millions of web pages, resembling the Trie data structure, making it suitable to store crawled data along with their relationships. Also by the concept of hashing, we can make sure that no link is inserted more than once in the trie. The two components of the data structure interact with each other to uniquely identify a particular value placed in the Trie using a Hash key. The hash function which generates a unique hash key for every input is recursively defined as

figure a
figure b

Theorem 1

Hash(node) hash function generates unique hash value for every key.

Proof

Every node in the Trie must be either a root node or a child node, and not both at a time. Now consider two nodes x and y from a Trie with at least two nodes.

Case 1 x and y are directly connected

If x is a parent of y, then

hash(y) = hash(x) + delimiter + someID ⇒ hash(y) ≠ hash(x)

If y is a parent of x, then

hash(x) = hash(y) + delimiter + someID ⇒ hash(y) ≠ hash(x)

Thus, hash(x) ≠ hash(y)

Case 2 x and y are siblings, i.e., they have a common parent

Let z be the parent of x and y. Now,

hash(x) = hash(z) + delimiter + ID1

hash(y) = hash(z) + delimiter + ID2

Since ID1 ≠ ID2, hash(x) ≠ hash(y)

Case 3 x and y are not related to each other

Let x1 be the parent of x and y1 be the parent of y.

  • If x1 and y1 are directly connected, then hash(x1) ≠ hash(y1), by case 1.

    hash(x1) + delimiter + ID1 ≠ hash(y1) + delimiter + ID2

    hash(x) ≠ hash(y)

  • If x1 and y1 are siblings, i.e., if they have a common parent

    hash(x1) ≠ hash(y1), by case 2.

    hash(x1) + delimiter + ID1 ≠ hash(y1) + delimiter + ID2

    hash(x) ≠ hash(y)

  • If x1 and y1 are not related to each other

    hash(x1) ≠ hash(y1) {termination condition falling under case (01) or (02)}

    hash(x1) + delimiter + ID1 ≠ hash(y1) + delimiter + ID2

    Thus, hash(x) ≠ hash(y) for all x and y.

Thus, Hash(node) hash function generates a unique hash value for every key.

3.5 Annotations

Data annotation is similar to tagging which allows users to organize information by combining them with tags or keywords. We annotate the textual content of a web page to come up with metadata that explains the context of the web page in brief. When the same web page is encountered again, these annotations help us know the context of the page, without parsing the entire page once again. Since more than 80% of the textual data from the web is unstructured or not categorized in a predetermined way, it is extremely difficult to analyze and process them. Hence, the crawlers can extract the keywords from the web to analyze the context of the web page more effectively. Our annotation module can be seen in Fig. 4. The model uses Term Frequency–Inverse Document Frequency (TF-IDF) model that comes from the language modeling theory: TF-IDF is presented in Eq. 1.

$$W_{ij} = \, t_{ij} * {\text{log}}\left( { \, N \, /{\text{ df}}_{ij} } \right)$$
(1)
Fig. 4
figure 4

Annotation module

Here, tij = term frequency score, dfij = document frequency score, and Wij = Weight of a tag in the text. Although a tag can have multiple words and there are resources in the modeling theory to analyze them, they are heavily time- and resource-consuming, and hence out of the scope of this work.

figure c

3.6 Similarity Module

The similarity module is presented in Fig. 5. The major components of this module are WordNet, spaCy, and Knowledge Base. This module has components derived from Natural Language Processing. The input to the system is from the user in the form of a tag. This tag is then matched with the annotations present in the knowledge base, if the threshold of the match is crossed then the respective link in knowledge base is returned as result, if no annotations in knowledge base match the entered tag, new crawl is initiated with the input query as a tag.

Fig. 5
figure 5

Similarity module

figure d
figure e

3.7 Knowledge Base

The crawler has a knowledge base where all the previously crawled links along with their annotations (tags) are maintained. Since the user provides the input query as a word, the crawler should look for links with tags matching the input query in the knowledge base. So, the tags should be mapped to their respective links. During the crawling, the crawler looks for the tags of a particular link in the knowledge base. So the links should be mapped to their respective tags. Also, a link can have more than one tag, and it needs to be mapped to each of them. Similarly, a tag can be a part of one or more links, and it should be mapped to each of them. Thus, knowledge base is a bidirectional multi-mapping. The internal storage is presented in Fig. 6.

Fig. 6
figure 6

The knowledge base

4 Results and Discussion

The input to the system is a keyword given as a query by the user. The system needs to start with a link, sometimes also known as a seed link, which is in context to the input query, and crawl all the further links using the algorithm described in the previous chapters. Adhering to the system capacity, power constraints, and scope, we scale down the input tests and give some relaxations in the performance metrics, according to the following assumption: user makes no spelling errors in the input query. The results of traditional and our crawler are tabulated in Table 1. We have used the website geeksforgeeks to tabulate the results as the site allows the bots to visit respective pages (rules as stated as of 01 December 2021). The system will stop crawling links once it has crawled at most 10 links.

Table 1 Results for the input query “sorting”

Five more queries were considered in this test, only on the smart crawler. But in this test, two successive iterations were conducted on the same set of queries to compare the performance of the crawler in both, the absence and presence of links in the knowledge base. The results obtained were plotted as shown in Figs. 7 and 8. The time comparison can be seen in Fig. 9.

Fig. 7
figure 7

Iteration 01: crawler in the absence of knowledge base

Fig. 8
figure 8

Iteration 02: crawler in the presence of knowledge base

Fig. 9
figure 9

Comparison of time required

Observations: The smart crawler was able to fetch more relevant links than the traditional crawler for the same set of queries. Also, the smart crawler took an average crawling time of (67.96 \(\pm\) 60.092) seconds, for the execution of a query in the first iteration, while it took an average of (2.3 \(\pm\) 1.4) seconds in the second iteration. The difference observed in the average time taken to crawl the links for a particular query indicates that the presence of the context of a web page being crawled in the knowledge base has successfully reduced the crawling time of the crawler by avoiding it from going through the content of the web page again.

5 Conclusion and Future Scope

The objective of the work is to design a crawler, which is aware of a link or a URL page and can decide whether to crawl the link or not without opening or requesting the HTML page of the link. This has been achieved through the knowledge base which acts as a memory for the crawler. Regardless of whether a new link opened and annotated is used in the current crawling process or not, it will be added to the knowledge base. This saves time in requesting, downloading, and annotating a web page that has already been downloaded in the past. Thus, the crawler learns the web gradually with time and improves its performance as more and more links get added to the knowledge base, similar to a machine learning model.