Keywords

1 Introduction

The extensive size of the Big Data does not allow Internet users to find all relevant information or products without the use of Web Search Engines or Recommender Systems. Web users can not be guaranteed that the results provided by search applications are either exhaustive or relevant to their search needs. Businesses have the commercial interest to rank higher on results or recommendations to attract more customers while Web search engines and recommender systems make their profit based on their advertisements and product purchase. The main consequence is that irrelevant results or products may be shown on top positions and relevant ones “hidden” at the very bottom of the search list. As the size of the Internet and Big Data increasingly expands, Web Users are more and more dependent on information filtering applications.

We describe the application of neural networks in recommender systems in Sect. 2. In order to address the presented search issues; this paper proposes in Sect. 3 an Intelligent Internet Search Assistant (ISA) that acts as an interface between an individual user’s query and the different search engines. We have validated our ISA against other Web search engines and metasearch engines, online databases and recommender systems on Sect. 4. Our conclusions are presented on Sect. 5.

2 Related Work

The ability of neural networks to learn iteratively from different inputs to acquire the desired outputs as a mechanism of adaptation to users’ interest in order to provide relevant answers have already been applied in the World Wide Web and recommender systems. S. Patil et al. [1] propose a recommender system using collaborative filtering mechanism with k-separability approach for Web based marketing. They build a model for each user on several steps: they cluster a group of individuals into different categories according to their similarity using Adaptive Resonance Theory (ART) and then they calculate the Singular Value Decomposition matrix. M. Lee et al. [2] propose a new recommender system which combines collaborative filtering with a Self-Organizing Map neural network. They segment all users by demographic characteristics where users in each segment are clustered according to the preference of items using the neural network. C. Vassiliou et al. [3] propose a framework that combines neural networks and collaborative filtering. Their approach uses a neural network to recognize implicit patterns between user profiles and items of interest which are then further enhanced by collaborative filtering to personalized suggestions. K. Kongsakun et al. [4] develop an intelligent recommender system framework based on an investigation of the possible correlations between the students’ historic records and final results. C. Chang et al. [5] train the artificial neural networks to group users into different types. They use an Adaptive Resonance Theory (ART) neural network model in an unsupervised learning model where the input layer is a vector made of user’s features and the output layer is the different cluster. P. Chou et al. [6] integrate a back propagation neural network with supervised learning and feed forward architecture in an “interior desire system”. D. Billsus et al. [7] propose a representation for collaborative filtering tasks that allows the application of any machine learning algorithm, including a feed forward neural network with k input neurons, 2 hidden neurons and 1 output neuron. M. Krstic et al. [8] apply a single hidden layer feed forward neural network as a classifier tool which estimates whether a certain TV programme is relevant to the user based on the TV programme description, contextual data and the feedback provided by the user. C. Biancalana et al. [9] propose a neural network to include contextual information on film recommendations. The aim of the neural network is to identify which member of a household gave a specific rating to a film at a specific time. M. Devi et al. [10] use a probabilistic neural network to calculate the rating between users based on the rating matrix. They smooth the sparse rating matrix by predicting the rating values of the unrated items.

3 The Intelligent Internet Search Assistant Model

The search assistant we design is based on the Random Neural Network (RNN) [1113]. This is a spiking recurrent stochastic model for neural networks. Its main analytical properties are the “product form” and the existence of the unique network steady state solution. It represents more closely how signals are transmitted in many biological neural networks where they actual travel as spikes or impulses, rather than as analogue signal levels, and has been used in different applications including network routing with cognitive packet networks, using reinforcement learning, which requires the search for paths that meet certain pre-specified quality of service requirements [14], search for exit routes for evacuees in emergency situations [15, 16] and network routing [17], pattern based search for specific objects [18], video compression [19], image texture learning and generation [20] and Deep Learning [21].

Gelenbe, E. et al. have investigated different search models [2224]. In the case of our own application of the RNN [25]; our ISA acquires a query from the user and retrieves results from one or various search engines assigning one neuron per each Web result dimension. The result relevance is calculated by applying our innovative cost function based on the division of a query into a multidimensional vector weighting its dimension terms with different relevance parameters. Our ISA adapts and learns the perceived user’s interest and reorders the retrieved snippets based in our dimension relevant centre point. Our ISA learns result relevance on an iterative process where the user evaluates directly the listed results. We evaluate and compare its performance against other search engines with a new proposed quality definition, which combines both relevance and rank. We have also included two learning algorithms; Gradient Descent learns the centre of relevant dimensions and Reinforcement Learning updates the network weights based on rewarding relevant dimensions and punishing irrelevant ones.

4 Validation

4.1 Web Search Validation

We can affirm the superior search engine has the higher density of better scoring results on top positions based on the result master list. In order to measure numerically Web search quality or to establish a benchmark from we can compare search performance; we propose the following algorithm, where results showed at top positions are rewarded and results showed at lower positions are penalized. We define quality, Q, as:

$$ {\text{Q}} = \;\sum\limits_{{{\text{result}} = 1}}^{\text{Y}} {{\text{RML}} * {\text{RSE}}} $$
(1)

where RML is the rank of the result in the master list that represents the optimum result relevance order, RSE is the rank of the same result in a particular search engine and Y is the number of results shown to the user, if the result order is larger than Y, we discard the result in our calculation as it is considered irrelevant.

We define normalized quality, \( \overline{\text{Q}} \), as the division of the quality, Q, by the optimum figure which it is when the results provided are ranked in the same order as in the master list; this value corresponds to the sum of the squares of the first Y integers:

$$ \overline{{\rm Q}} = \;\frac{{\rm Q}}{{\frac{{{{\rm Y(Y}} + 1 ) ( 2 {{\rm Y}} + 1 )}}{ 6}}} $$
(2)

where Y the total number of results shown to the user.

The Intelligent Internet Search Assistant we have proposed emulates how Web search engines work by using a very similar interface to introduce and display information. We validate our proposed ISA with current Metasearch engines, we retrieve the results from the Web search engines they use to generate the result master list and then compare the results provided by the Metasearch engines against this result master list. This proposed method has the inconvenience that we are not considering any result obtained from Internet Web directories neither Online databases from where Metasearch engines may have retrieved some results displayed. We have selected both Ixquick and Metacrawler as the Metasearch engines we can compare our ISA. After analysing the main characteristics of both Metasearch engines we consider Metacrawler uses (Google Yahoo and Yandex) and Ixquick uses (Google Yahoo and Bing) as their main source of search results. We have run our ISA to acquire 10 different queries based on the travel industry from a user. The ISA then retrieves the first 30 results from each of the main Web search engine driver programmed (Google, Yahoo, Bing, and Yandex), we have therefore scored 30 points to the Web site result that is displayed in the top position, 1 point to the Web site result that is shown in the last position and 0 points to each of the result that belongs to the same Web site and it is shown more than once. After we have scored the 120 results provided by the 4 different Web search engines, we combine them by adding the scores of the results which have the same Web site and rank them to generate the result master list. We have done this evaluation exercise for each high level query. We then retrieve the first 30 results from Metacrawler and Ixquick and benchmark them against the result master list using the Quality formula proposed. We present the average Quality values for the 10 different queries on the below table (Table 1).

Table 1. Web search validation

4.2 Online Academic Database Validation

In order to measure search quality we can affirm a better Online Academic Database provides with a list of more relevant results on top positions. We propose the following quality description where within a list of N results we score N to the first result and 1 to the last result, the value of the quality proposed is then the summation of the position score based of each of the selected results. Our definition of Quality, Q, can be defined as:

$$ {{\rm Q}} = \;\sum\limits_{{{{\rm i}} = 1}}^{{\rm Y}} {{{\rm RSE}}_{{\rm i}} } $$
(3)

where RSEi is the rank of the result i in a particular search engine with a value of N if the result is in the first position and 1 if the result is the last one. Y is the total number of results selected by the user. The best Online Academic Database would have the largest Quality value. We define normalized quality, \( \overline{\text{Q}} \), as the division of the quality, Q, by the optimum figure which it is when the user consider relevant all the results provided by the Web search engine. On this situation Y and N have the same value:

$$ \overline{{\rm Q}} = \;\frac{{\rm Q}}{{\frac{{{{\rm N(N}} + 1 )}}{ 2}}} $$
(4)

We define I as the quality improvement between our Intelligent Search Assistant against an Online Academic Database:

$$ {\text{I}} = \;\frac{\text{QW }-\text{ QR}}{\text{QR}} $$
(5)

Where I is the Improvement, QW is the quality of the Intelligent Search Assistant and QR is the quality reference; we use the Quality of Google Scholar, IEEE Xplore, CiteseerX or Microsoft Academic as QR in our validation exercise in the first iteration and on further iterations, we use as Quality reference the value of the previous iteration.

Our Intelligent Internet Search Assistant can select between the main Online Academic (Google Scholar, IEEE Xplore, CiteseerX or Microsoft Academic) and the type of learning to be implemented. Our ISA then collects the first 50 results from the search engine selected, reorders them according to its cost function and finally shows to the user the first 20 results. Our ISA reorders results while learning on the two step iterative process showing only the best 20 results to the user. We have searched for 6 different queries. We have used the four different Online Academic Databases for each query, 24 searches in total. We have selected Gradient Descent and Reinforcement Learning for 3 queries (12 searches) each. The table shows the average Quality value of the Database search engine and ISA. The first I represents the improvement from ISA against the Online Academic Databases; the second I is between ISA iterations 2 and 1 and finally the third I is between the ISA iterations 3 and 2 (Table 2).

Table 2. Online academic database validation

4.3 Recommender System Validation

We have implemented our Intelligent Search Assistant to reorder the results from three different independent recommender systems: GroupLens film database, Trip Advisor and Amazon. Our ISA reorders the films or products based on the updated result relevance calculated by combining only the value of the relevant selected dimensions. The higher the value the more relevant the film or product should be. ISA shows to the user the first 20 results including its ranking. The user then selects the films or products with higher ranking; this ranking has been previously calculated by adding user reviews to the same products and calculating the average value. We have included Gradient Descent and Reinforcement Learning for different queries in our validation. Experimental results are shown on the following table (Table 3).

Table 3. Recommder system validation

5 Conclusions

We have proposed a novel approach to Web search and recommendation systems within Big Data; the user iteratively trains the neural network while looking for relevant results. We have also defined a different process; the application of the Random Neural Network as a biological inspired algorithm to measure both user relevance and result ranking based on a predetermined cost function. Our Intelligent Search Assistant performs generally slightly better than Google and other Web search engines however, this evaluation may be biased because users tend to concentrate on the first results provided which were the ones we showed in our algorithm. Our ISA adapts and learns from user previous relevance measurements increasing significantly its quality and improvement within the first iteration. Reinforcement Learning algorithm performs better than Gradient Descent. Although Gradient Descent provides a better quality on the first iteration; Reinforcement Learning outperforms on the second one due its higher learning rate. Both of them have a residual learning on their third iteration. Gradient Descent would have been the preferred learning algorithm if only one iteration is required; however Reinforcement Learning would have been a better option in the case of two iterations. It is not recommended three iterations because learning is only residual.