Keywords

1 Introduction

The extraction of the feature vectors from the query-document is a critical step in learning to ranking. The main effort lies in mapping the query-document pair into a joint feature space which can precisely establish their relevance.

The common method is to extract the features for each of the input text pair and then rank them on the various similarity measures based on the lexical and semantic analysis. But the similarity defined in a different measure will lead to the different rank sequences. Surdeanu et al. [8] explore a wide range of classes of the features such as coarse word sense disambiguation, name-entity identification, syntactic parsing, and sematic role labeling. Although there is a large mount of text resource in the Web, but the labeled semantic resource for the supervised learning such as Penn Treebank is rare especially for the minority language, e.g. Chinese or Korean.

Deep learning especially the Convolutional Neural Networks (CNN) has been recently shown that it can efficiently learn to embed the sentences into a low dimensional vector space but preserve their syntactic and semantic relations in many NLP tasks [2, 10]. Severyn and Moschitti [7] build the CNN on the sentences pair in an end-to-end manner, where their model on the TREC question answering task outperforms the state-of-art systems without the manual feature engineering and the additional syntactic parsers.

In Baidu Cup 2016 Challenge, the competition invites the participants to tackle the problem of the Chinese entity search on four tasks including restaurants, movies, TV shows and celebrities. Given a query on the entity and a set of candidate entities, the ranking system should rank the entities with their relevance to the query. It is similar to the question answering, but with one characteristic: the answering only contains the entity name, which is too short to express its hiding meanings. For example, when querying “President of U.S.A.”, can we predict the relevance of the celebrity ‘Isaac Newton’ to the question without domain knowledge? It is impossible to achieve a good performance without the domain knowledge if we only focus on the lexical, the statistical, and the semantical feature information of the answer.

In this paper, we describe a novel ranking learning architecture for the ranking of the entity object to help us to achieve the champion of Baidu Cup 2016 challenges. The distinctive properties of our architecture are: (1) we crawl the domain knowledge automatically to extend the answers according to the different tasks; (2) the statistical relevance features are extracted to build ranking models, but we only use the simple frequency features since the syntax or the semantical resource on Chinese text is difficult to obtain; (3) we also define the new semantic relevance features by means of word2vec to handle the large scale of corpus, in contrast to the CNN [7] which is restricted to the short texts or sentences; (4) we adopt the simple point-wise method to take the relevance features as the input instead of directly computing their similarity.

We validate our architectures on the four tasks and analyze the effects of the components on the ranking performance, e.g. MAP or MRR. In the following, we describe the components of the ranking learning system and report our state-of-art experimental results. Finally, we conclude the paper and the outline the future work.

2 Framework of Learning to Rank Entity

The section briefly describes entity search problem, then discusses learning to rank, which will train a rank model to predict the order of the candidate answer.

2.1 Problem Formulation

The query is a set of keywords or key-phrases used by the users to express their desire. An entity is a thing with distinct and independent existence such as celebrity, restaurant, movie and tvShow in our tasks. The goal is to query the description targeting the entities, e.g. the review on some movie, the feeling in the restaurant environment.

Given an entity search query \(q_i \in Q\) and a set of the candidate entities \(E_i = {(d_1,r_1),(d_2,r_2),\cdots ,(d_{i_k},r_{i_k}),(d_{i_n},r_{i_n})}\), where \(d_{i_k}\) is the entity object and \(r_{i_k}\) is the relevant label equal to 1 if relevant to the query and 0 otherwise. The objective is to retrieve entities that is relevant to the query \(q_i\) from \(E_i\) under the ranking function is:

$$\begin{aligned} h(\varvec{w},\phi (q,D)) \rightarrow R, \end{aligned}$$
(1)

where \(\phi (q,D)\) is a query-dependent features depending both on the entity and the query and \(\varvec{w}\) is the parameter of the ranking function.

2.2 Extending Entity Extension by External Resources

We have recently seen a rapid and successful growth of Baidu BaikeFootnote 1, which is a largest open Chinese encyclopedia on the Web. It has now more than 13,000,000 word-items edited by approximate 6,000,000 free volunteers or professional personnel. The Baike aims to be a Chinese encyclopedia and the articles on the Baike is refer to all aspects of the Chinese culture. We extract the knowledge for each entity object from celebrity, movie and tvShow. Baike will be much easier than from raw texts or from usual Web texts because of its structure. The objectiveness of the Baike also help us to rank the entity according to the query precisely. In fact, many natural language processing studies try to exploit Wikipedia as a knowledge source [1, 9] for the English language.

The restaurant tasks is an exception since there is no needs for each restaurant for the Baike which aims to provide the authoritative resources or knowledge. So we crawled the review pages from Dazhong reviewFootnote 2, which is the largest city life website guiding the mass consumption in China.

Finally, we also collected the information and the user reviews for the movie and the tvShow tasks from Douban websiteFootnote 3, which provides the information and reviews on the books, the movies (including tvShow) and the music generated by over 2 billions Chinese users.

2.3 Feature for Learning

In this section, we introduce the features extracted in our experiments which can be directly used by the learning algorithm. Each row of the matrix corresponding to the feature file represents a query-entity pair. The other files for each query-entity pair separately records the query id, the entity id, the answer id in the query and the label shows the entity is relevant to the query or not.

Table 1. Statistical features on corpuses: the words streams come from the title, the body and the title + the body and it can be segmented phrase or 2-ngram words

Word Features. In the following, We give some details of these features, the fore part of which is referenced to the dataset LETOR 3.0 [6].

In the corpus, we considered three types of streams: title, body and title + body. For the entity body, we cut the whole part into the sentences by the end mark of the Chinese language and the English language for convenience. We also removed all Chinese and English punctuation character (Table 1).

Word2vec [3] is a series of models used to produce the word embeddings, where the models are a two-layer neural networks that take as the input a large corpus of text and produce a corresponding vector for each unique word in the high-dimensional space. Although word2vec plays a part just for computing the similarity between the words, we have no knowledge about the computation of the sentence similarity. In the following, we first give some heuristic approaches to compute the similarity between any sentence based on the public available word2vecFootnote 4.

The similarity between the query word \(q_i\) and the sentence s is defined as the max value among the similarity between \(q_i\) and the word \(s_i\) in the sentence s

$$\begin{aligned} sim(q_i,s) = \max _{s_j \in s} q_i^T s_j, \end{aligned}$$
(2)

where \(q_i\) and \(s_i\) is a normalized vector with the unit length and both of them are extracted on the trained model from all types of entity corpus by word2vec.

We arrange all \(q_i \in q(i = 1,2,\cdots ,m)\) into the matrix \(Q = [q_1,q_2,\cdots ,q_m]^T\) and \(s_j \in s(j = 1,2,\cdots ,n)\) into the matrix \(S = [s_1,s_2,\cdots ,s_n]^T\), then

$$\begin{aligned} R = QS^T \end{aligned}$$
(3)

where \(R = [r_1,r_2,\cdots ,r_m]^T\) and \(r_i = q_i^T s\). It is clear that

$$\begin{aligned} sim(q_i,s) = \Vert r_i\Vert _{\infty } \end{aligned}$$
(4)

The similarity computation of the query q and the sentence s is related to \(sim(q_i,s)\) for all \(q_i \in q\). With the sum and max operation, we can define the following four features including Sum of Similarity (SS), Sum of Weighted Similarity (SWS), Max of Similarity (MS) and Max of Weighted Similarity (MWS)

$$\begin{aligned} SS(q,s)= & {} \sum _{q_i \in q} sim(q_i,s) \end{aligned}$$
(5)
$$\begin{aligned} SWS(q,s)= & {} \sum _{q_i \in q} sim(q_i,s)*idf(q_i) \end{aligned}$$
(6)
$$\begin{aligned} MS(q,s)= & {} \max _{q_i \in q} sim(q_i,s) \end{aligned}$$
(7)
$$\begin{aligned} MWS(q,s)= & {} \max _{q_i \in q} sim(q_i,s)*idf(q_i) \end{aligned}$$
(8)

2.4 Word Segmentation and 2-Gram Words

Chinese Word segmentation is the problem of dividing a string into its component words. It is a critical step for Chinese language processing. But the performance of the algorithm depends the domain specific dict and the used corpus. The most word segmentation algorithm does not handle the ambiguous words and unregistered ones. In our work, we adopt the simple 2-gram representation to complement the deficiencies of the word segmentation considering most of Chinese phrases consist of two words.

2.5 Ranking Model Design

In our work, we chose the ensemble approaches to predict the similarity probability for each query-entity pair. Our experiment took three candidates including AdaBoost, Random Forest and ExtraTree Classifier [5]. Further, we also tried to fuse the posterior probabilities and the rankings from multiple classifiers although the improvement is subtle on the celebrity and the restaurant datasets.

3 Experiments

3.1 DataSet Description

Baidu Challenge 2016 includes four datasets including movies, tvShows, restaurants and celebrities. For each type, there are 100 entity queries for the training and 1,000 ones for the testing. Before the competition ends, 40% of the test queries were used as the development set for all participants. In our experiments, we do not consider the results on the development set. We extracted the feature vectors from all query-entity pairs, where Table 2 gives the detailed information.

Table 2. DataSet Information

3.2 Experiments Design and Results

Data Retrieval. We crawled Baike, Douban, Dazhong web sites by the Baidu crawler and Yahoo crawler. It is important to validate the correctness of the crawled pages. We saved the meta information in the front of the texts to check whether it is consistent with the corresponding entity or not.

Preprocessing. We preprocessed the Chinese texts by removing all Chinese punctuations after splitting the total texts into the sentences with the Chinese punctuations as the end mark. Further, the sentences were split into the single Chinese words by the Jieba open sourceFootnote 5 and by sliding on the text with the two-width window (2-gram), separately.

Word Embeddings. We initialized the word embeddings by running word2vec tool [4] on the Chinese corpus. The tvshow and movie tasks used the corpus contain roughly 2 million vocabularies, 1 million ones for the celebrity tasks and 0.8 million ones. To train the embeddings we used the continuous of bag words model with the window size 5 to generate a 50-dimensional vectors for each word. The embeddings vector not present in the word2vec model were randomly initialized with the equal length vector with each component taken from the uniform distribution \(U[-0.25,0.25]\). On both of word segmentation form and 2-ngram form, we extracted 33 features from each corpus and merged them into 66-dimensional features.

Experiment Setup. We evaluated the performance on the training dataset by 10-fold cross validation (cv-10). In particular, the query-entity pair from the same query would be put into the same fold for keeping the completeness of each query. The cross folds were kept invariant for all parameters settings.

In implementing the extra tree classifier, the parameter \(n\_estimator\) were randomly drawn from the integer range [100, 500] and another parameter \(max\_depth\) was randomly from the enumeration range \(\{4,6,8,10,12\}\). The models parameters were optimized in the space of the grid with the parameter \(n\_estimator\) and \(max\_depth\) by the cross-validation on the training data.

Evaluation Measures. The competition uses the Mean Average Precision (MAP) to evaluate the quality of the submission file, which is common in the information retrieval. MAP examines the ranks of all the related entities and computes the mean over the average precision scores for each query

$$\begin{aligned} MAP(q) = \frac{1}{|q|} \sum _{q_i \in q} avgprec(q_i). \end{aligned}$$
(9)

Meanwhile, we computes \(avgprec(q_i)\) as follows

$$\begin{aligned} avgprec(q_i) = \sum _{e \in q_i}\frac{rel(e,q_i)}{pos(e)} \end{aligned}$$
(10)

where \(rel(e,q_i)\) (1 or 0) shows whether the entity e is correlated with the query q and pos(e) is the position of the entity e in the ranking sequence of the query q.

Finally, we achieve the first place of the Baidu Challenge 2016 competition as shown in the Table 3. Limited to the short readiness time, more experiments and analysis are ongoing in order to keep the integrity of the entire paper. In the further work, we will promptly give more detailed comparisons and comprehensive experimental analysis.

Table 3. Competition results on four tasks

4 Conclusion

In this paper, we explore the merging of the simple word frequency features and the word2vec-based sematic features to solve the entity search problem. The effectiveness of the features is shown on the Baidu Challenge 2016 competitions datasets. We explain the entire processing of the experiments. We achieved the best performance with the merging of the features and the extra tree ranker (point-wise ranker). In future work, we will improve and finish the experiment comparisons, furthermore, we will take these features as the preprocessing step of the deep learning and apply the CNN to learn the relation between the query and the entity.