Keywords

1 Introduction

WWW is the largest repository of digital images in the world. The number of images available over the Web is exponentially growing and will continue to increase in future. However, as compared to text, the annotation of images by means of the semantics they depict is much more complicated. Humans can recognize objects depicted in images, but in computer vision, the automatic understanding the semantics of the images is still the perplexing task. Image annotation can be done either through content-based or text-based approaches. In text-based approach, different parts of a Web page are considered as possible sources for contextual information of images, namely image file names (ImgSrc), page title, anchor texts, alternative text (ALT attribute), image surrounding text. In the content-based approach, image processing techniques such as texture, shape, and color are considered to describe the content of a Web image.

Most of the image search engines index images using text information associated with images, i.e., on the basis of alt tags, image caption. Alternative tags or alt tag provides a textual alternative to non-textual content in Web pages such as image, video, media. It basically provides a semantic meaning and description to the embedded images. However, the Web is still replete with images that have missing, incorrect, or poor text. In fact in many cases, images are given only empty or null alt attribute (alt = “ ”) thereby such images remain inaccessible. Image search engines that annotate Web images based on content-based annotation have problem of scalability.

In this work, a novel approach for extracting pertinent keywords for Web image annotation using semantic distance and Euclidean distance is proposed. Further, this work proposes an algorithm that automatically crawls the Web pages and extracts the contextual information from the pages containing valid images. The Web pages are segmented into Web content blocks and thereafter semantic correlation is calculated between Web image and Web content block using semantic distance measure. The pertinent keywords from contextual information along with semantic similar content are then used for annotating Web images. Thereafter, the images are indexed with the associated text it refers to.

This paper is organized as follows: Sect. 2 discusses the related work done in this domain. Section 3 presents the architecture of the proposed system. Section 4 describes the algorithm for this approach. Finally, Sect. 5 comprises of the conclusion.

2 Related Work

A number of text-based approaches for Web image annotation have been proposed in recent years [1]. There are numerous systems [2,3,4,5,6] that use contextual information for annotating Web images. Methods for exacting contextual information are (i) window-based extraction [7, 8], (ii) structure-based wrappers [9, 10], (iii) Web page segmentation [11,12,13].

Window-based extraction is a heuristic approach which extracts image surrounding text; it yields poor results as at times irrelevant data is extracted and relevant data is discarded. Structure-based wrappers use the structural information of Web page to decide the borders of the image context but these are not adaptive as they are designed for specific design patterns of Web page. Web page segmentation method is adaptable to different Web page styles and divides the Web page into segments of common topics, and then each image is associated with the textual contents of the segment which it belongs to. Moreover, it is difficult to determine the semantics of text with the image.

In this work, Web page is segmented into Web content blocks using vision-based page segmentation algorithm [12]. Thereafter, semantic similarity is calculated between Web image and Web content block using semantic distance measure. Semantic distance is the inverse of semantic similarity [14] that is the less distance of the two concepts, the more they are similar. So, semantic similarity and semantic distance are used interchangeably in this work.

Semantic distance between Web content blocks is calculated by determining a common representation among them. Generally, text is used for common representation. As per the literature review, there are various similarity metrics for texts [13, 15, 16]. Some simple metrics are based on lexical matching. Prevailing approaches are successful to some extent, as they do not identify the semantic similarity of texts. For instance, terms Plant and Tree have a high semantic correlation which remains unnoticed without background knowledge. To overcome this, WordNet taxonomy as background knowledge is discussed [17, 18].

In this work, the word-to-word similarity metric [19] is used to calculate the similarity between words and text-to-text similarity is calculated using the metric introduced by Corley [20].

3 Proposed Architecture

The architecture of proposed system is given in Fig. 1. Components of proposed system are discussed in following subsequent subsections.

Fig. 1
figure 1

Proposed architecture

3.1 Crawl Manager

Crawl manager is a computer program that takes the seed URL from the URL queue and fetches the Web page from WWW.

3.2 URL Queue

URL queue is a type of repository which stores the list of URLs that are discovered and extracted by crawler.

3.3 Parser

Parser is used to extract information present on Web pages. Parser downloads the Web page and extracts the XML file of the same. Thereafter, it convert XML file into DOM object models. It then checks whether valid images are present on the Web page or not. If valid image is present on the Web page, then the page is segmented using visual Web page segmenter; otherwise, next URL is crawled. The DOM object models which contain page title of Web page, image source, and alternative text of valid images present on the Web page are extracted from the set of object models of the Web page.

3.4 Visual Web Page Segmenter

Visual Web page segmenter is used for the segmentation of Web pages into Web content blocks. By the term segmentation of Web pages means dividing the page by certain rules or procedures to obtain multiple semantically different Web content blocks whose content can be investigated further.

In the proposed approach, VIPS algorithm [12] is used for the segmentation of Web page into Web content blocks. It extracts the semantic structure of a Web page based on its visual representation. The segmentation process has basically three steps: block extraction, separator detection, and content structure construction. Blocks are extracted from DOM tree structure of the Web page by using the page layout structure, and then separators are located among these blocks. The vision-based content structure of a page is obtained by combining the DOM structure and the visual cues. Therefore, a Web page is a collection of Web content blocks that have similar DOC. With the permitted DOC (pDOC) set to its maximum value, a set of Web content blocks that consist of visually indivisible contents is obtained. This algorithm also provides the two-dimensional Cartesian coordinates of each visual block present on the Web page based on their locations on the Web page.

3.5 Block Analyzer

Block analyzer analyses the Web content blocks obtained from segmentation. Further, it divides the Web content blocks into two categories: image blocks and text blocks. Web blocks which contain images are considered as image blocks and rest are considered as text blocks.

3.6 Nearest Text Block Detector

Nearest text block detector detects the nearest text blocks to an image block. For checking closeness, Euclidean distance between closest edges of two blocks is calculated. Distance between two line segments is obtained by using Eq. (1):

$$ {\text{Euclidean}}\;{\text{Distance }} = \sqrt {(x_{2} - x_{1} )^{2} + (y_{2} - y_{1} )^{2} } $$
(1)

After the distance is calculated between each image block pair and text block pair, the text blocks whose distance from image block is below the threshold are assigned to that image block. In this way, each image block is assigned with a group of text blocks which are closer in distance with that image block.

3.7 Tag Text Extractor

In the proposed approach, tag text extractor is used for extracting text from the HTML tags. Parser provides the DOM object models by parsing a Web page. If the image present on this Web page is valid, i.e., it is not a button or an icon, which is checked by valid image checker, are extracted from metadata of image like image source (Imgsrc), alternative text (Alt). Page title of the Web page which contains this image is also extracted.

3.8 Keyword Extractor

In this work, keyword extractor is used to extract keywords from the metadata of images and page title. Keywords are stored into a text file which is further used for obtaining semantically close text blocks by calculating semantic distance.

3.9 Semantic Distance Calculator

Semantic distance calculator is used to determine the semantic correlation among the Web content blocks. As lexical matching between words does not provide better results, here the words are matched with concepts in a knowledge base and concept to concept matching is computed using WordNet.

Before computing text similarity between image metadata and text blocks, preprocessing of the text blocks is done. After preprocessing process, sentence detection is done. Then tokenization is done, and finally, a part of speech tagging is done for all the words using NLP. At last, stemming of the terms is done and thereafter, terms are mapped to WordNet concepts.

The similarity of text is calculated using Corley’s approach. In this method, for every noun (verb) that belongs to image metadata, the noun (verb) in the text of text blocks with maximum semantic similarity is identified according to Eq. 2.

$$ {\text{sim}}_{\text{Lin}} = \frac{{2.{\text{IC}}\left( {\text{LCS}} \right)}}{{{\text{IC}}\left( {{\text{Concept}}_{1} } \right) + {\text{IC}}\left( {{\text{Concept}}_{2} } \right)}} $$
(2)

Here LCS is the least common subsumer of the two concepts in the WordNet taxonomy, and IC is the information content that measures the specificity for a concept as follows:

$$ {\text{IC}}({\text{concept}}) = - \log P({\text{concept}}) $$
(3)

In Eq. 3, P(concept) is the probability of occurrence of an instance of concept in a large corpus. For the classes other than noun (verb), a lexical matching is performed. The similarity between two texts T1 (text of image metadata), T2 (text of text blocks) is calculated as:

$$ {\text{sim}}\left( {T_{1} ,T_{2} } \right)_{{T_{1} }} = \frac{{\sum\nolimits_{{w_{i} \in T_{1} }} {{\text{maxSim}}\left( {w_{i} ,T_{2} } \right).idf\left( {w_{i} } \right)} }}{{\sum\nolimits_{{w_{i} \in T_{1} }} {idf\left( {w_{i} } \right)} }} $$
(4)

where idf(w i ) is the inverse document frequency [19] of the word w i in a large corpus. A directional similarity score is further calculated with respect to T1. The score from both directions is combined into a bidirectional similarity as given in Eq. 5:

$$ {\text{sim}}\left( {T_{1} ,T_{2} } \right) = {\text{sim}}\left( {T_{1} ,T_{2} } \right)_{{T_{1} }} .{\text{sim}}\left( {T_{1} ,T_{2} } \right)_{{T_{2} }} $$
(5)

This similarity score has a value between 0 and 1. From this similarity score, semantic distance is calculated as follows:

$$ {\text{dist}}_{\text{sem}} \left( {T_{1} ,T_{2} } \right) = 1 - {\text{sim}}\left( {T_{1} ,T_{2} } \right) $$
(6)

In this way, semantic distance is calculated among image block and its nearest text blocks. The text block whose semantic distance is less is the semantically correlated text block to that image block.

3.10 Text Extractor

Text extractor is used to extract text from text blocks present on the Web page. Text of semantically close text block obtained in the previous step is extracted and buffered. This text along with the text extracted from image metadata and page title of Web page is used to extract frequent keywords.

3.11 Keyword Determiner

In this work, keyword determiner is used to extract keywords from the text stored in a buffer. Frequent keywords are determined by applying a threshold on the frequency count of keywords. Keywords whose frequency is above the threshold are extracted and used for annotating images.

3.12 Image Annotation

Page title of Web page, image source of image, alternative text of image, and frequent keywords extracted in the previous step—all of these describe the image best.

4 Algorithm

The algorithm for proposed system is automatic image annotation. This algorithm takes the URL of Webpage as input and provides the description of the Web page as output.

This algorithm is used here for annotating images present on the Web page. Firstly, parsing is done to extract page title, Img_Src, Alt Text of image. Secondly, Web page segmentation is performed using VIPS algorithm. Then validity of image is checked and for valid images find nearest text blocks using the algorithm given below. For closer text block list, semantic distance is calculated using bidirectional similarity between blocks. Then keywords are extracted from the semantically close text block. These keywords are used for image annotation process.

Automatic_Image_Annotation (Description of image)          Begin       Parse the Web page (URL)       If contain valid image                \( {Text}_{1} \) = Extract Page_Title, Img_Src(valid image),                alt( valid image)                Web_Page_Segmentation (URL)               For each valid image_block                        Text_block_list =  Find_Nearest_text_block                        (ImageBlock Cartesian Coordinates, Text Blocks                         Cartesian Coordinates)       least_distance  =  some_big_number;       For each text block in Text_block_list       Distance = Find_semantic_distance (Text Block,  \( {Text}_{1} \) )       If (least_distance >  distance)       {                               Least_distance = Distance            Return  \( {id}_{text} \) ;                 }                 Extract keywords from text block (  \( {id}_{text} \) )                 End       End

Algorithm for obtaining nearest text blocks is find nearest text blocks. It takes image blocks and text blocks as input and provides a list of nearest blocks as output. This algorithm collects the nearest text blocks to an image block present on the Web page using closest edge Euclidean distance between Web content blocks. It uses the Cartesian coordinates of Web content blocks to calculate Euclidean distance.

Find_Nearest_Text_Block (List of Nearest Text Blocks)         Begin         For each Text Block                   {                             Distance  =  calculate Euclidean distance                             between image block and text block          If (distance  <  threshold)                           {                                          Put the id of text block in a list                              }                   }        End

5 Conclusion

This paper presents algorithm for the novel approach for extracting pertinent keywords for Web image annotation using semantics. In this work, Web images are automatically annotated by determining pertinent keywords from contextual information from Web page and semantic similar content from Web content blocks. This approach provides better results than method of image indexing using Web page segmentation and clustering [21], as in existing method context of image, it is not coordinated with the context of surrounding text. This approach will provide good results as closeness between image and Web content blocks is computed using both Euclidean distance and semantic distance.