1 Introduction

Social media have become very important sources of information in disaster situations (Imran et al. 2015; Varga et al. 2013). Social media provide information to both victims of the disaster, as well as providers of relief and resources. Information Retrieval (IR) techniques can provide a means for effectively and efficiently sharing this information both within and between these two groups. Recognizing the importance of developing robust IR tools for locating relevant content in social media sources, IR benchmark tasks, specifically the FIRE 2016 Microblog track (Ghosh and Ghosh 2016) and SMERP 2017 (Ghosh et al. 2017) tasks, have been organized to provide evaluation exercises for testing such tools and techniques.

Traditional approaches to IR typically rely on word-matching based retrieval methods like BM25 (Robertson et al. 1994) and Language Modeling (LM) (Hiemstra 2000; Ponte and Croft 1998). These approaches perform well only when queries and documents use the same vocabulary, and both are sufficiently detailed. However, such models are likely to perform poorly if relevant documents lack the important terms present in the query. This problem can be particularly acute for microblog posts,Footnote 1 since their brevity (at most 140 characters) means that they may not contain many of the necessary keywords. Consequently, tweets retrieved from a tweet collection using these models may not contain relevant documents at top ranks. The widespread occurrence of spelling variations, abbreviations and code-mixing in social media generally makes IR from tweets even more difficult.

The vocabulary mismatch problem may be alleviated by the use of semantic matching techniques which involve matching the semantic intent of the query with the relevant documents, without explicitly needing the presence of all the query key terms in the document. With this hypothesis in mind, we investigate the use of word embeddings for tweet retrieval in this paper. We use word embeddings generated by the popular Word2vec (Mikolov et al. 2013) tool, and compare conventional word matching models (LM and BM25) with document embedding based retrieval. We find that the latter produces better results than the former, mainly because the semantic matching techniques are able to bridge vocabulary mismatch between queries and documents. This finding is consistent with those of earlier studies that explored the use of word embedding methods on non-disaster datasets (Diaz et al. 2016).

The remainder of the paper is organized as follows. We discuss related work in Section 2. We then introduce our proposed approach in Section 3. Experimental results are presented in Section 4. We conclude by sketching some directions for further work in Section 5.

2 Related Work

Early work on microblog retrieval includes a language modeling approach for searching Microblog posts proposed by Massoudi et al. (2011). Their method incorporated query expansion and used certain “quality indicators” during matching. Hashtag retrieval (Efron 2010) is also closely related to our work. HashtagsFootnote 2 refer to certain important “keywords” in a message that are designated as tags using a hash (#) sign. Hashtags are useful as a very quick method for categorizing or tagging messages. Efron et al. (2010) showed that for a Twitter collection, hashtags can be predicted using query expansion. Dong et al. (2010) proposed a ranking method that takes both “relevance” and freshness into account. Del Corso et al. (2005) also suggested a ranking method for news documents in which recency plays a major part. Bandyopadhyay et al. (2012) studied the use of external resources for query expansion during tweet retrieval.

Word embeddings map words in a text collection to relatively low-dimensional real-valued vectors. These vectors are supposed to capture the semantics (or more precisely the contexts or usage patterns) of words within the text. Word2vec (Mikolov et al. 2013a, b) proposed by Mikolov et al. is an efficent word embedding method that has received much attention in the last few years. We use this method in our proposed model. This idea was later extended in (Le and Mikolov 2014) to sentences and documents. Lau et al. (2016) present an empirical evaluation of these ideas, along with some practical insights into document embedding generation. Ganesh et al. (2016) describe a two-phase document embedding method. Recently, Kim et al. (2017) proposed the use of word embeddings to represent a documents’ concepts. They create clusters of semantically similar words from documents, following which a document is represented as a bag-of-concepts, rather than the traditional bag-of-words. Xing et al. (2014) use word embeddings within a document classification method. They represent a document as a Gaussian Mixture Model estimated from the constituent word vectors.

3 Our Approach

As mentioned in the Introduction, our objective is to study whether word embeddings may be used to improve retrieval effectiveness over traditional IR models.

Proposed Model

We propose the use of document embeddings to represent each document as a vector. Our model is a straightforward extension of word2vec (Mikolov et al. 2013b) to the document level. Suppose document d contains a set of words Wd, then the vector representation of d is given by

$$ \overrightarrow{d}=\sum\limits_{i = 0}^{|{W_{d}}|} \overrightarrow{w_{id}} $$
(1)

where widWd is the ith distinct word of document d, and \(\overrightarrow {w_{id}}\) is the word2vec embedding of wid. The same representation is used for each query, say q, which can similarly be viewed as a set of words. Let \(\overrightarrow {q}\) be the embedding vector of the query q. Then the similarity between q and d may be calculated by the following scoring function:

$$ Sim(q, d) = CosSim(\overrightarrow{q}, \overrightarrow{d}) = \frac {\overrightarrow{q}\cdot \overrightarrow{d}}{||\overrightarrow{q}|| \cdot ||\overrightarrow{d}||} $$
(2)

where CosSim is the cosine similarity between the vectors, and \(\overrightarrow {q}\cdot \overrightarrow {d}\) represents the inner-product between \(\overrightarrow {q}\) and \(\overrightarrow {d}\).

For the word embeddings used above, we tried both the models that are commonly used with word2vec:

  • V1: ‘Continuous bag of words’ (Mikolov et al. 2013a), and

  • V2: ‘Skip gram‘ (Mikolov et al. 2013b).

3.1 Baselines

Next, we briefly review the baseline methods against which our proposed model is compared.

3.1.1 Language Model

The general idea of language modeling based retrieval can be described in the following way. Let Q be a query and d be a document. Let \(\mathcal {D}\) represent the language model estimated from d. Then the score of document d with respect to query Q is given by \(p(Q|\mathcal {D})\), the probability of generating Q from \(\mathcal {D}\). The language model \(\mathcal {D}\) associated with the document d is usually approximated by a unigram language model, i.e., a probability distribution over single words. Assuming that any pair of distinct terms occurs independently, the retrieval score of d for a given query Q can be written as

$$\begin{array}{@{}rcl@{}} \text{Score}(Q,d) & = & p(Q|\mathcal{D})\\ & = & \prod\limits_{q\in Q}p(q\lvert d) \end{array} $$
(3)

3.1.2 Jelinek-Mercer Smoothing (LM-JM) (Jelinek and Mercer 1980)

To overcome the zero probability problem (when a query term is missing from document d, its score becomes zero according to Eq. 3), the language model \(\mathcal {D}\) is smoothed by interpolating the Maximum Likelihood Estimate (MLE) of p(wi|d) with a background language model, estimated from the entire collection C, as in Eq. 4.

$$\begin{array}{@{}rcl@{}} p(Q|\mathcal{D}) & = & \prod\limits_{q\in Q}\left[(1-\lambda) p(q \lvert d) + \lambda p(q \lvert C)\right]\\ & = & \prod\limits_{q\in Q} \left[ (1-\lambda) \frac{\mathit{tf}(q,d)}{\lvert d\lvert} + \lambda \frac{\mathit{cf}(q)}{\lvert C\lvert} \right] \end{array} $$
(4)

Here, tf(q,d) and cf(q) indicate the number of occurrences of q in the document d, and in the whole document collection C, respectively; |d| and |C| represent the total number of words in the document and the collection respectively; λ = [0, 1] is the interpolation parameter. This language modeling based retrieval model is known as language model with Jelinek-Mercer smoothing or linear smoothing, and is referred to as JM or, LM-JM in the rest of this article.

3.1.3 Dirichlet Smoothing (LM-D) (MacKay and Peto 1994)

Another smoothing method that is also used by researchers is the Dirichlet prior smoothing method. The mathematical form of this model is similar to LM-JM (Eq. 4), except that the interpolation parameter is a function of document length instead of being a constant (see Eq. 5).

$$ P(Q\lvert \mathcal{D}) = \prod\limits_{q\in Q} \frac{\mathit{tf}(q,d) + \mu p(q\lvert C)}{\lvert d\lvert + \mu} $$
(5)

3.1.4 Okapi BM25

The Okapi BM25 (Robertson et al. 1994) ranking function uses term frequencies and inverse document frequencies as follows:

$$ \text{Score}(Q,d) \,=\, \sum\limits_{q\in Q} \log\frac{N\,-\,\mathit{df}(q) \,+\, 0.5}{\mathit{df}(q) \,+\, 0.5} \frac{\mathit{tf}(q,d)} {k_{1} ((1\,-\,b) \,+\, b\frac{\mathit{dl}}{\mathit{avdl}}) \,+\, \mathit{tf}(q,d)} $$
(6)

where N is the total number of documents in the collection, df(q) is the number of documents containing the term q, dl and avdl are respectively the length of document d and the average length of documents in the collection, and b, k1 are parameters.

3.1.5 doc2vec (D V)

Doc2vec (DV) is a document embedding method proposed by Mikolov et al. (2013a).

3.1.6 Word Mover’s Distance (W M D)

Kusner et al. (2015) proposed the Word Mover’s Distance (WMD) to measure the dissimilarity between two documents d and d′. To compute WMD, first each word in d,d′ is represented by its embedding. Then, each word in d is “moved” to a word in d′ in total, or in parts. The cost of moving a word to another is measured by the Euclidean distance between their embeddings. Roughly, the minimum overall cost of moving all words in d to words in d′ gives the WMD between d and d′.

4 Experiments and Results

4.1 Datasets

For our experiments, we used the data provided by the First International Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP)Footnote 3 held at ECIR 2017. The SMERP collection consists of 372,220 tweets / microblogs posted during the August 2016 earthquake in central Italy, collected over three days. The dataset also contains four queries and their relevance assessments. Each SMERP query has 4 fields:

  • the query number tagged with <num> and </num>;

  • a title (tagged using <title> and </title>), which is a very brief representation of the information need;

  • a description (tagged with <desc> and </desc>), which is a somewhat more detailed specification of the information need; and

  • a narrative (tagged with <narr> and <narr>), a thorough specification the information need, that spells out what is and is not relevant.

All four SMERP queries are provided in Appendix.

4.2 Experimental Setup

4.2.1 Pre-processing

Tweets and queries were pre-processed before indexing. The pre-processing steps are as follows:

  1. 1.

    URLs were removed.

  2. 2.

    Any token containing only punctuation was removed.

  3. 3.

    Any token containing only digits was removed.

  4. 4.

    Stopwords were removed.

  5. 5.

    Finally, the rest of the tokens were stemmed using Porter’s stemmer (Porter 1997).

After pre-processing, the SMERP collection contains 3,03,867 distinct words / tokens (Tables 1 and 2).

Table 1 Performance of various methods
Table 2 Relevant tweets retrieved by semantic matching

4.2.2 Baseline Methods

Parameters for the baseline methods (LM-JM, LM-D and BM25) were tuned on the test dataset itself. The reported performance of these baselines may thus be regarded as somewhat unrealistically high. Details of parameter tuning for these methods are given in Tables 34 and 5.

Table 3 Language Model Jelinek-Mercer Smoothing (LM-JM) parameter λ tuning using only SMERP data
Table 4 Language Model Dirichlet Smoothing (LM-D) parameter μ tuning using only SMERP data
Table 5 Okapi BM25 (BM25) parameter k1 and b tuning using only SMERP data

4.2.3 Generating Word and Document Embeddings

We used a variety of corpora to generate word embeddings. We use the following labels for different variants of our basic approach (Section 3) that use embeddings generated from different sources / models.

  • WV: uses embeddings generated from the SMERP collection using the continuous-bag-of-words model.

  • WVSG: uses embeddings generated from the SMERP collection using the skip-gram model.

  • WV + MB: uses continuous-bag-of-words embeddings generated from a combined corpus consisting of both the SMERP collection and the TREC 2011 Microblog Track collection (Ounis et al. 2011).

  • MB: uses continuous-bag-of-words embeddings generated from only the TREC 2011 Microblog Track collection.

Table 1 compares the performance of different methods. From the table, it is clear that WV outperforms other traditional IR methods as well as WV + MB, MB, WVSG, DV and WMD on all the evaluation measures. WV, the best method, achieved a MAP value of 0.0672 which is about 29% better than the second-best method (LM-JM). In terms of GMAP, the performance of WV is almost 32% superior to the second-best method (LM-JM). For both R-Prec and P@k, WV produces the best result for almost all the reported values of k. Once again, these results are considerably better than most of the baselines.

Since WV outperforms WVSG by a big margin, we did not generate skip-gram embeddings from the TREC 2011 Microblog Track collection or the merged collection.

From Table 1, we also see that word embeddings do not help if they are generated using tweets not related to the disaster period (MB, WV + MB). To be precise, the use of embeddings generated from MB or even WV + MB degrades the overall performance.

The superiority of the proposed method (variant WV ) may be explained as follows. WV retrieves some relevant tweets where, apart from the query terms, there are some useful terms semantically related to the information need in the queries. These tweets were retrieved within the top 50 ranks by WV, but the other methods were unable to retrieve them even in the top 1000 ranks. For example, query SMERP-T1 seeks information about available resources. Thus, information about services like “free wifi”, “sms”, “calling facility” etc. is relevant. A relevant tweet (tweet id 768418420060745728) was retrieved at rank 37 by WV, but this tweet was not retrieved by the other models. This tweet has some terms such as “disaster”, “earthquake”, and “monitoring” that are important and semantically related to the query, although they are not query keywords. WV thus seems to be able to find semantically relevant tweets which contain important terms that are absent from the queries. Some more examples of a similar kind are given in Table 2.

5 Conclusion

Our proposed model WV comprehensively outperforms traditional information retrieval methods such as LM-JM and BM25, as well as other document embedding approaches such as DV and WMD. However, the absolute values of all metrics, even for the best performing method (WV ) are very low. In the near future, we intend to investigate the reasons for this poor performance. We also need to study why the skip-gram model (used in WVSG) performed worse than the continuous-bag-of-words model (used in WV ) for generating word embeddings. Finally, the poor performance of DV and WMD also needs looking into.

Our experiments suggest that embeddings generated using tweets outside the disaster time period hurt rather than improve performance. It is possible that better word embeddings may be learnt if more tweets can be collected from during the disaster period. We hope to explore this hypothesis in future work.