1 Introduction

With the ever widening spread of computers, communications and the Internet, social media is becoming increasingly important as a means of social interaction. In particular, social media has quickly established itself as an important means that people, NGOs and governments use to spread information during natural or man-made disasters, mass emergencies and crisis situations. In these scenarios, social media is used to report first-hand (“ground zer ) experiences including photos and videos, near-hand observations, contact relatives and friends, request help, disseminate information about available help and other services, organize local search and rescue operations, monitor situation, report status, damages or losses etc. Given this important role, real-time analysis of social media contents to locate, organize and use valuable information for disaster management is an active research area; see Imran et al. (2015) for a comprehensive survey.

In this paper, we propose a weakly supervised learning algorithm to automatically detect social media content having information about disasters. We include natural disasters like earthquake, flood, hurricane, famine, forest fire, volcano eruption, tsunami, land slide, disease epidemic (e.g. H1N1, swine flu, ebola) and man-made disasters like nuclear power plant accidents. We exclude crimes, accidents, insurgencies, terrorist attacks and war.

Factual information about various kinds of disasters, as expressed in, say, news stories, is often similar at a broad level: they mostly include aspects like damage, injuries, deaths brought about by a disaster, search, rescue, relief, recovery, rehabilitation etc. We generalize this informal observation to a natural principle, which might be called as the principle of information correspondence: in a given type of documents, similar events are expressed similarly. The similarity of expression of information is essentially at the level of semantics. Clearly, news and tweets are two distinct types of documents. Thus we may expect that different news items will broadly express information about disasters in a similar manner; and different tweets will broadly express information about disasters in a similar manner, although information expression across news and tweets need not be similar. Example text from news and tweets related to two types of disasters: earthquake and flood is shown in Table 1. Crimes, accidents, acts of terrorism, and even weather reports, may sometimes contain similar information as a disaster. Thus any technique for automatically detecting disaster-related information should reject such confounding pieces of information.

Table 1 Some example news and tweets related to earthquakes and floods

In this paper, our main goal is to build a model, with minimal supervision, that can be used to quickly classify tweets in an incoming stream as DISASTER-RELATED(+ 1) or NOT-DISASTER-RELATED(− 1). We require the model to be simple, human-understandable (and even human-modifiable) and usable in a real-time scenario. A reason for this requirement is that the users of such a model (i.e., users who read and analyze disaster related tweets for further actions) find it easier to deal with word-based models in order to act quickly and clearly. Our word-based models have the advantage that these users can dynamically change and update them without too much efforts, in order to align the models with the drifts in the nature and content of tweets.

Toward this end, we propose self-learning algorithms that, with minimal supervision, construct models of information expressed in news about various natural disasters. The constructed model is extremely simple and basically consists of a set (bag) of characteristic words that are often present in information expressed about disasters. The algorithm constructs a single model for all disasters i.e., it does not differentiate among different classes of disasters, although our approach is easy to use for learning a model for a specific disaster type.

Since tweets are a different type of documents than news, we next propose a model transfer algorithm, which essentially refines the model learned from news by analyzing a large unlabeled corpus of tweets. We show empirically that model transfer improves the predictive accuracy of the model. We demonstrate empirically that our model learning algorithm is better than several state of the art semi-supervised learning algorithms. Once the word model is deployed to classify a stream of tweets, we often find that the model needs to be adjusted to handle drifts in the vocabulary used to report disasters. As a first step toward dynamically adjusting the word model for tweet classification, we propose an online algorithm that learns and automatically adjusts weights for each word in the initial word model.

The paper is organized as follows. Section 2 contains related work, Section 3 contains learning algorithms, Section 4 contains baselines, Section 5 contains experiments, Section 6 presents an online algorithm that learns weights for words in a model, and Section 7 outlines conclusions and further work.

2 Related Work

In this section, we summarize the work related to disaster detection, focusing primarily on semi-supervised learning and transfer learning. Imran et al. (2015) surveyed the computational methods to process social media messages and map them to the problems like detection of events, creation of actionable and useful summaries etc. Zhao et al. (2007) has used textual, social and temporal characteristics to detect events on social streams. Social text streams are represented as multi-graphs with nodes as social actors and information flows as edges. Events are detected by combining text-based clustering, temporal segmentation, and information flow-based graph cuts of the dual graph of the social networks. Sakaki et al. (2010) has built an earthquake reporting system based on event detection. They have used a classifier based on the keywords in tweet, the size of tweet based on number of words and the context to detect earthquake like event. Once an event is identified a probabilistic spatio-temporal model is built for finding the center and trajectory of the event. LITMUS (Musaev et al. 2014) is a landslide detection system which integrates USGS seismic data, NASA TRMM Rainfall network with Twitter, Instagram and Youtube. Social media data is filtered using keyword-based filtering, geo-tagging, classification and relevance score is computed to detect landslides.

Ritter et al. (2015) cast seed-based event extraction as a weakly supervised learning problem where only positive and unlabeled data is available. They regularize the label distribution over unlabeled examples toward user-specified expectation of the label distribution for the keyword. Zhou et al. (2012) presented a self-training algorithm that decreases the disagreement region of hypotheses. The algorithm supplements the training set with self-labeled instances. The instances that greatly reduce the disagreement region of hypotheses are labeled and added to the training set. Yang et al. (2009) developed a technique to identify evolution of relationships between the news events within a same topic. Event evolution identification technique automatically identifies event evolution relationships and represents as Event Evolution Graph (EEG). EEG helps to understand how events evolve along the timeline.

Our work is similar in spirit to query expansion, where the idea is to add suitable words to the user’s initial query so as to improve the retrieval results. We do not survey those works here, except for Zhao et al. (2014), who present a query expansion algorithm, which they use to create a tweet graph and then present an anomaly detection method to specifically for detecting civil unrest related tweets in this graph.

There is a large amount of work in semi-supervised classification which uses a large amount of unlabeled data and a small amount of labeled data to build better classifiers. Nigam et al. (2000) introduced an algorithm for learning from labeled and unlabeled text documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. A classifier is trained using the labeled documents and probabilistically labels are predicted for unlabeled documents. Then a new classifer is trained using the labels for all the documents and iterates to convergence. We use a simplified version of this approach as a baseline. Davidov et al. (2010) utilized the semi-supervised sarcasm identification algorithm of Tsur et al. (2010) and proposed SA SI algorithm that successfully captures sarcastic sentences in twitter and other domains. The algorithm employs two modules: semi supervised pattern acquisition for identifying sarcastic patterns that serve as features for a classifier, and a classification stage that classifies each sentence to a sarcastic class.

Transfer learning involves leveraging knowledge learnt from source domain/task to improve learning in target domain/task. Dai et al. (2007) has proposed a transfer-learning algorithm for text classification based on an EM-based Naive Bayes classifier. First the initial probabilities under a distribution of labeled data set are estimated and then an EM algorithm is used to revise the model for a different distribution of the unlabeled test data. This approach has been used by Zhao et al. (2013) for crowd-selection on twitter. Guerra et al. (2011) analyzes sentiments by using opinion holder bias prediction. First, the bias of a social media user toward a specific topic is measured by solving the relational learning task over a network of users connected by endorsements. The sentiments are analyzed by transferring user biases to textual features. They show that even when the topic changes its profile as new terms arise and old terms change their meaning, the user bias helps in building more accurate classification models due to consistency over time.

Roy et al. (2012) have proposed SocialTransfer - a cross domain real time transfer learning framework. SocialTransfer uses topic space learned in real time via Online Streaming Latent Dirichlet Allocation and real time cross domain graph spectra analysis based transfer learning method. The Transfer Graph captures the relationships between videos and topics which cannot update itself due to the data constraints. The social social streams relationships with the topics are updated using the streaming social media data to mark the change in the topic profile. The updated transfer graph is then used for video recommendation and query suggestion for video search.

Online learning of classification is fast emerging as a new and practically useful setting for classification. Many online learning algorithms for classification assume an ensemble (or committee of experts setting) (Mohri et al. 2012). In our paper, we have followed and modified the classic perceptron algorithm for online learning, although our classification model is not a linear model.

Each tweet is a very short and a rather noisy document. Computing similarity (say, for clustering) between short text segments is a challenging problem, because simple document representations such as TF-IDF suffer due to the short sizes of the documents. Even semantic representations, such as word embeddings, suffer from similar problems, when constructed from a corpus of short documents. De Boom et al. (2016) propose a specialized representation learning method for short texts that creates a more semantic representation by weighing word embeddings created from Wikipedia and Twitter. A common way to measure similarity between two texts is to measure the similarity between their mean vectors - where the mean vector of document is computed as the mean of the embeddings of the words in that text. This simple approach has several limitations, and Kenter and de Rijke (2015) proposes a way to address some of them, by creating bins of the dimensions and measuring similarity over these bins.

3 Learning Algorithms

3.1 Weakly Supervised Model Learning from News

Our approach is two-fold. In the first learning phase, we learn a one-class classification model for identifying documents of class + 1 (which are disaster reporting news), as against any other kind of document (class = − 1). The learnt model is in the form of a word set W, consisting of words which characterize only the class + 1. We are not interested in characterizing class − 1, and in that sense this is a one-class classification problem. The model W is learnt in a weakly supervised manner - the only “help” given to the model learning algorithm is in the form of a small labeled seed setD, where each document in D is labeled with class = + 1 (i.e., each document is a known to be related to disaster), and a small set W0 of known seed words which partially “characterize” class + 1 (i.e., are related to disasters). In addition, a large corpus U of unlabeled documents is given i.e., none of the documents in U are labeled with any class label (either + 1 or − 1). Since the text in news is significantly different than that in tweets, we need to transfer this model to learn to classify tweets. That part is discussed in Section 3.2. Finally, in the prediction (operation) phase, we use the final model (i.e., the news domain model transferred to the tweets domain) to dynamically classify any incoming tweet as having class = + 1 or not.

We represent each input document Ui as a word tuples sequence (WTS)σi, which is an ordered sequence of word tuples: \(\sigma _{i} = \langle wt_{1}, wt_{2}, \ldots , wt_{N_{i}}\rangle \), where Ni is the number tokens in the document Ui and each wtj, 1 ≤ jNi, is a word tuple. Each word tuple has the form wtj = (wj, tj, cj, fj), where wj is a word token, tj is its POS tag (using Stanford POS tagger), cj is the term frequency i.e., the number of times this token occurs in the current input document irrespective of its POS tag, fj is its document frequency i.e., the number of documents in the entire corpus U in which this token occurs irrespective of its POS tag. Thus each word tuple is essentially a feature vector for each word token in the document. Word tokens are considered after stop-word removal and stemming. We insert a dummy word tuple to mark the sentence boundaries. Note that if a word w occurs multiple times in the same document, σi will contain multiple word tuples corresponding to w; these word tokens may differ in the POS tag component, but otherwise they would be identical. If N denotes the number of documents in the corpus U, then one way to to compute the TFIDF for a word wj is: \(c_{j} \cdot log\frac {N}{f_{j}}\), where N = |U| is the total number of documents.

The simplest way to select a set of words characterizing disasters from a corpus U would be based on TFIDF. Unfortunately, since the documents in U are not labeled, this method selects all sorts of words, very few of which are disaster related. Hence we need a different method. The algorithm learn_disaster_model (Fig. 1) initializes the model W with the given set W0 of characteristic keywords, plus words that frequently occur “around” words in W0 in the known disaster reporting documents D. Initially, all documents in U are marked as − 1. The algorithm iteratively examines each document in U (among those which are still marked as − 1), and checks if the label for this document can be changed to + 1, as follows. If the set W1 of frequently occurring words in this document does not have a significant overlap with the current model W, then this document is ignored currently i.e., its label continues to be − 1. If the set W1 does contain a significant overlap with the current model W, then this document is marked as + 1 and is never considered again. But before going to the next document, the algorithm selects those words (if any) from W1, which occur in WordNet but whose corpus count in WordNet is not “too high”, and adds them to the current model. After finishing the examination of all documents in U, the algorithm continues to the next iteration, because the labels of some more documents may now change, if new words were added to W in the previous iteration. The algorithm stops after a user-specified number of iterations or if no words were added to W in the previous iteration.

Fig. 1
figure 1

Algorithm to learn the disaster word model

The subroutine GetContextWords(W0, nD, window, S) works as follows. For every word w in the given seed list W0, compute the set X of all words (nouns or verbs only) which occur before or after w in a window of given size in any sentence in the documents in the given set of documents S. For each word x in X, find the number of documents in S in which x occurs and remove x from X if this frequency is less than the given threshold nD. So far, the output model is just a set of words. We later present another algorithm that “learns” a weight for each word wi in the model. Alternatively, for the weight of a word wi in the model, we could use its TFIDF score, or the conditional probability P(wi|class = + 1).

3.2 Model Transfer Algorithm

Suppose we have a model Wnews learned from news. The simplest approach would use the model learned from the news corpus as it is on tweets, to predict which tweets are related to disasters. We will show in Section 5.2 that this approach has less accuracy, because news and tweets are different types of documents in terms of the vocabulary and style of writing. We need to refine the model Wnews, by removing and adding words, to construct a new model Wtweet (we focus on adding new words only). We propose a new transfer learning algorithm, augment_model (Fig. 2), to augment the model from a source domain (e.g., news) by examining the unlabeled corpus from the target domain (e.g., tweets). For this, we assume that we have an unlabeled corpus of tweets available (dataset D4).

Fig. 2
figure 2

Transfer learning algorithm to augment a source model

A new word is added to the model if it co-occurs with “sufficient” frequency with a word in Wnews. Pointwise mutual information (PMI) between two words u and v is defined as \(PMI(u, v) = log \left (\frac {p(u, v)}{p(u)p(v)} \right )\) where, p(u, v) is the probability of co-occurrence of u and v, p(u) and p(v) is the probability of co-occurrence of u and v respectively. Out of several available alternatives, we use PMI as a measure of similarity between a word in the model and any other word in the unlabeled corpus. We select top N0 (we used N0 = 25) having the highest PMI with any word in the model Wnews and remove words not present in WordNet (unless they begin with #), or are named entities person, location, organization etc. We add the remaining words from this list to Wnews to get Wtweet.

One issue with tweets is the prevalence of informally written words. We have found that most of the important disaster related words are not written informally in a tweet. Also, informally written words in a tweet (such as plz, pls, lol, thx etc.) do not tend to come into the model for the following reason. While we do remove some stopwords from tweets, we have used the presence of a word in WordNet as a major constraint for words in a model, which rejects such informal words creeping into the model.

3.3 Model-based Classification

Let a given document (a news or a tweet) d contain words (nouns or verbs) Wd = {v1, v2,…, vk}. We have a simple and efficient real-time algorithm identify_disaster_tweet that can use the given model W to predict the class label for any given document d. Basically, if the similarity between the set Wd and the given model W is more than a user-specified threshold 𝜃1 then the algorithm predicts class = + 1 (disaster related) else it predicts class = − 1 (not disaster related). We use the Jaccard similarity between W and Wd, which is just \(\frac {|W \cap W_{d}|}{|W \cup W_{d}|}\).

We have found that disaster related documents sometimes look similar to those related to crime, accidents, war, terrorism or weather forecasts. To reduce this confusion, we can use the corpus to create separate models for each of these class of documents. We then modify our model-based classification algorithms to use these negative models as follows. If the similarity between the set Wd and any given negative model Wneg is more than a user-specified threshold 𝜃2 then predict class = − 1 for d. If d is not similar to any of the negative models, only then we use the previous rule to predict whether d is disaster related or not.

4 Baseline Methods

We have created some baseline methods to compare our approach with. Starting with a given set W0 of “seed” keywords characterizing disasters, the algorithm wordset_expansion (Fig. 3) detects and adds other words (only nouns or verbs) in a given unlabeled corpus, which are very similar to those in W0 i.e., it creates a single “cluster” of words, starting with a set of cluster prototype or representative words. The algorithm does not use the set of known disaster documents, nor does it impose any restrictions on the frequencies of the words to be added. The cosine similarity uses the word embeddings produced by GloVe (Pennington et al. 2014).

Fig. 3
figure 3

Algorithm to expand a set of given seed words

We designed a second baseline method which uses topic modeling (topic_based). We extracted 300 topics using the mallet toolkit (McCallum 2002) on dataset D1 and manually labeled the topics as disaster-related or not. We found 7 topics (out of 300) to be related to disasters. For example, following are some words in one of the topics: earthquake, ocean, warning, quake, tsunami, tremor, magnitude, strike, damage. For any given document, mallet gives a topic distribution within that document. We used a simple classification rule that if the most frequent topic within a document D is one of the disaster related topics, then label D as + 1 else as − 1. We use this topic-based classification scheme as a baseline because, it is also weakly supervised, like our approach.

We used the semi-supervised classification method of Transductive SVM (Joachims 1999) as the third baseline; we used the SVM-Lite tool to train a Transductive SVM using dataset D1.

Our fourth baseline algorithm is a self-training algorithm NB_iterative, which takes the same set of unlabeled and positive examples as given to learn_disaster_model, along with a small set NegSet of known negative examples. In each iteration, it trains a simple Naive Bayes classifier on the current sets of positive and negative examples, and predicts a class label Ld for each document dU with confidence c. If Ld = + 1 and c is “sufficiently high” then it adds d to D and removes it from U; else if Ld = − 1 and c is “sufficiently high” then it adds d to NegSet and removes it from U; otherwise, d remains in U without any class label. After a specified number of iterations, the final Naive Bayes model is tested. This algorithm is also weakly supervised and it does not use the user-specified set of seed keywords.

5 Experimental Studies

5.1 Datasets

Training News Dataset D1

This dataset contains 9983 documents. Out of these, 10 earthquake related documents are explicitly labeled as + 1 (DISASTER-RELATED), and are used as seeds. Among the 9973 remaining unlabeled documents, we know that there are 50 documents related to other disasters (e.g., hurricanes and floods), 60 each for crime, accidents and weather, although this information is not passed to the disaster model learning algorithm. The remaining 9743 unlabeled documents are randomly selected news items from the FIRE corpus, some of which may be related to disasters, crime, accidents or weather, but we do not know which ones. The FIRE (Forum for Information Retrieval Evaluation) corpusFootnote 1 contains 392,577 English news items from Indian newspapers such as The Telegraph and BDNews.

Test News Dataset D3

This fully labeled dataset consists of 2537 news items, out of which 162 are labeled + 1 (they are related to several natural disasters) and 2375 are labeled − 1. Among the latter, 2225 news items (labeled − 1) are from the BBC news website (Greene and Cunningham 2006) corresponding to stories in five topical areas (business, entertainment, politics, sports and technology) from 2004-2005, and 50 each are news related to crime, accidents and weather.

Labeled Tweets Dataset D2

This dataset contains 1344 disaster related tweets (labeled + 1) and 2696 non-disasters tweets (labeled − 1), among the latter 100 tweets related to crime, 100 tweets related to accidents and 100 tweets related to weather. The tweets were labeled manually by us. The tweets labeled + 1 were related to a variety of natural disasters, such as avalanche, cyclone, drought, flood, forest fire, landslide, tsunami, volcano as well as nuclear accidents and biological disasters.

Unlabeled Tweets Dataset D4

Using the Twitter Streaming API, we downloaded a corpus of 7,555,000 English language tweets from 12 to 19 September 2016. All tweets are unlabeled.

5.2 Results

We trained the algorithm learn_disaster_model on dataset D1. We started with the following seed words: disaster, die, death, ruin, earthquake, quake, avalanche, landslide, cyclone, famine, flood, forestfire, fire, tsunami, volcano, H1N1, flu, ebola, epidemic, outbreak, radiation, nuclear. We fixed the values of the parameters as follows: n0 = 2, n1 = 4, nD = 5, ninv = 10000, cnoun = 6, cverb = 39, window = 10, 𝜃0 = 0.085, 𝜃1 = 0.025. This specific model (M1) contained 40 words. Some of the words in the model (not present in the seed words) were: aftershock, tremor, magnitude, rain, storm, damage, kill, collapse. We used this model to predict disaster related tweets on dataset D2, which gave F = 0.703 (entry M1 in Table 2).

Table 2 Experimental results on dataset D2

So far, these results do not use our transfer learning algorithm. Next, we started with the model M1 (as created by learn_disaster_model on D1) and used our transfer learning algorithm augment_model on D4, which led to the addition of these words to the model erupt, lava, #prayforkorea. We tested this augmented (transferred) model on D2 (with 𝜃1 = 0.025), which gave a higher F-measure of 0.734, indicating the advantage provided by the transfer learning even in the unsupervised setting (entry M2 in Table 2).

Note that the model M2 does not use any negative models, such as the one for Accident. Hence, we started with an initial seed list of 29 words for Accident (e.g., accident, crash, wreck, collide, sink, drown, injure, die, capsize), trained on D1 with 𝜃0 = 0.07, which resulted in a new model for Accident containing 38 words. Then we transferred this model to the tweets domain, using dataset D4, which resulted in a model containing 43 words. Finally, we used the model M2, along with this negative model for Accident with 𝜃1 = 0.07, to modify the predictions of M2 as mentioned earlier on dataset D2 (entry M2b in Table 2). Since this model M2b has a better performance than M2, albeit only slightly, this validates our proposition that negative models have the potential to improve the prediction accuracy by reducing false positives of the Disaster model. As an example, the model M2 classifies the tweet pakistan train crash deaths and injuries reported collision kills at least six people and injures more than as class + 1 (Disaster), but the model for Accident correctly recognizes this tweet as belonging to the Accident class, and hence it is not classified as Disaster by the model M2b. Finally, Table 2 also shows the results obtained to predict disaster related tweets on D2 using the various baselines discussed earlier.

6 Online Modification of Word Model

6.1 Online Learning of Weights of Words

The word model obtained after the Model Transfer from the news domain to the tweets domain is static in terms of the words and also, no weight (i.e., importance) is associated with the words in the model. There is a need for an online learning algorithm that can dynamically adapt and modify this model, to cope with the wide variety of text in the incoming real-life tweets text, far wider than the limited corpus from which the word model is learnt. The changes to the word model are of three kinds: add new words, remove older words or change the weights of the words already in the model. In this section, we only examine the problem of dynamically adapting the weights of the words in the model.

We define a weighted word model as a set of words with a real number as a weight for each word: Mwt = {u1 : w1, u2 : w2,…, un : wn}; here, wiR, and can be positive, negative or even 0. Note that the model learnt using the previous algorithm is not a weighted word model i.e., it has no weights attached to any word. So we need to first convert this model to a weighted word model. And then we need to adapt this initial weighted word model to cope with the variations in the text in the incoming tweet stream. We use a common algorithm (algorithm learn_weights_online) for both these purposes (Fig. 4), which is a slightly modified version of the perceptron algorithm.

Fig. 4
figure 4

Algorithm to learn weights of model words online

The algorithm learn_weights_online takes a weighted word model Mwt as input. It examines tweets in an incoming stream in a sequential manner: if the incoming tweet has a true label, then it modifies the weights of words in the model (explained shortly), else it leaves the model unchanged. For every incoming tweet t (which has a true label y), the algorithm computes S which is the intersection of t and Mwt. It then computes the sum of the weights of the words in S, as per Mwt. If this sum is positive then the predicted label \(\hat {y}\) for t is + 1, else it is 0. If the predicted label does not match with the true label of t (i.e., \(y \neq \hat {y}\)), then the weights of only the words in S are modified by the magnitude of learning rate γ (γ is constant and is given as an input to the algorithm). If the weight of a word in Mwt becomes negative, it indicates a word that whose presence indicates some evidence of the tweet being a non-disaster tweet.

Let M2 denote the word model learnt using the model transfer algorithm. Since words in this model have no weights, we assume that each word in it has weight 0. We pass this initial weighted word model (with all weights 0) as input to the algorithm learn_weights_online, along with a finite dataset D5 of labeled tweets, where the ground truth label for each tweet is the one predicted by M2 (along with the negative models for Accident and Crime). Then the output of this algorithm is a new weighted word model M3, in which each word has a sensible weight. In this case, each iteration of the algorithm processes all tweets in D5, and the algorithm stops either after a maximum number of iterations is reached or when there are very few weight updates. Note that the algorithm includes an extra feature to capture bias i.e., the intercept.

In the next stage, we use the same algorithm in the true online setting, where the weights of the words in M3 keep getting updated whenever it receives a tweet with a ground-truth label. We could use some active learning strategy (such as uncertainty sampling) to pick tweets for labeling by the user at run time.

6.2 Baselines

For experimental validation, we will use the new weighted word model (modified using the algorithm learn_weights_online) to predict the class label for each tweet in a test dataset. For comparison, we have created several baseline methods, which also produce a class label for each tweet in our test dataset.

The first baseline method wordpair_similarity computes the pairwise cosine similarity between each word in a tweet and each word in the model M2, using the word embeddings from GloVe (Pennington et al. 2014). Then it selects top 5 word pairs, computes the average of their cosine similarity and predicts class label + 1 if this average is above a given threshold and 0 otherwise. We searched for several values of the threshold and report the best result.

We used Generalized Expectation Feature Labeling (GEFL) (Druck et al. 2008) as another baseline. This approach treats words in a document as features, and trains the maximum entropy document classifier with expectation constraints that specify affinities between words and class labels. For each word in our model M2, we provided a high probability value (0.9999) for that word belonging to the class + 1 and a low probability value (0.0001) for that word belonging to the class 0. We then trained the MaxEnt classifier using our labeled dataset, with these probability values as constraints. To be fair, note that our constraints refer to only the class + 1; we have not provided probabilities for any words outside the model (being too numerous). We are not using any standard classifier as a baseline, since we are working in an online setting.

As another baseline, we used the semi-supervised Label Propogation algorithm (LPA) (Zhu et al. 2003). We constructed a directed graph, where each word in a corpus (only if it was present in WordNet) formed a vertex and two words u, v were connected by two directed edges (uv and vu), each labeled with the pairwise mutual information (PMI) value between the two words computed from the corpus. We added an edge between a word pair only if the two words co-occurred in at least 5 tweets. We labeled the vertices (corresponding to the 40 words from the model M2) as + 1 and we manually selected 40 non-disaster words and labebeled the corresponding 40 vertices as − 1. We used the D5 corpus (explained below) to construct this graph, which had a total of 11,755 vertices and 2,11,444 edges.

6.3 Dataset for Online Learning

Unlabeled Tweets Dataset D5

This is a subset of 755,427 tweets randomly selected from D4. D5 is used for learning the weights of words in the model M2 (phase 1 as discussed above). For each tweet t in D5, we produce a ground-truth label as follows: use model M2 to predict the label of t; if any of the negative models (say, for Accident) predicts label + 1 for t then we make the label of t as 0; else we keep the same label as predicted by M2. Thus D5 is now a labeled dataset (the labels are not ground truth, but are produced by our algorithms). Since D5 is only used for initializing the weights in the model, we have chosen not to use the entire D4 dataset for this purpose.

Labeled Tweets Dataset D6

This is a separate, manually labeled, dataset containing 2250 tweets, out of which 658 are disaster related tweets (labeled + 1) and 1592 are non-disaster tweets (labeled 0).

6.4 Experiments with Online Learning

As explained earlier, we start with the model M2 (in which each each word has weight 0), and use the algorithm learn_weights_online on D5 to create a new weighted word model M3, with hyperparameter values as γ = 0.001, max_iter = 200, min_error_count = 1. In M3, some of the top and bottom words (with highest and lowest weights) are: earthquake, damage, death, disaster, and #prayforkorea. Thus M3 is the same as M2. except that each word now has a weight.

Next, we use the algorithm learn_weights_online with D6 and model M3 as input, with the same hyperparameter values as above, to learn another model M4, in which the weights of M3 are modified. We now use model M4 to predict the class labels for the test dataset D7, using the same prediction formula mentioned in algorithm learn_weights_online. Baseline wordpair_similarity is used directly to predict class labels for tweets in D2 (we report the best result for among various threshold values). We train the GEFL baseline on dataset D6 (as explained earlier) and use the trained MaxEnt classifier to predict the class labels for the tweets in test dataset D2. We also prepared the LPA model using the Label Propagation algorithm, as explained earlier. Running the LPA algorithm for 100 iterations and with 𝜖 = 0.00001, on the graph constructed on D5 corpus, resulted in 192 vertices labeled + 1 (including the initial 40 vertices); we call this model LPA. We used this word model to classify the tweets, as explained earlier. Note that the performance of M3 and M4, is better than all the three baselines (see Table 3). Note also that M4 performs better than all models, M1, M2, M3, indicating that the weighted version of the word model works better. It also outperforms all the baselines. As seen, there is a big jump in the accuracy of the models from M3 to M4, justifying our step of learning of the weights of the words in the model.

Table 3 Experimental results for Online Learning on dataset D2

So far, we have only tried to adapt the weights of the words in the model, but we did not add any new words to the model. We have also experimented with addition of new words to the model using the following strategy. Suppose we get a tweet t with a known ground truth (class label y), which is misclassified by the current model. Then we add all those words from t to the current model, with initial weight = \((y - \hat {y}) \cdot \gamma \), provided the words are not currently in the model and each word satisfies our constraints as specified in learn_disaster_model. The weights of these newly added words then gets updated (as per the new model’s predictions for subsequent tweets) just like any other word in the model. We start with model M4 and simultaneously add new words to the model and modify the weights of words in the model to get another model M4b. Unfortunately, we found that the word addition quickly diverges, and the model keeps getting larger and larger. We are looking for a strategy (e.g., based on word embeddings) which imposes a tight check on the words before they get added to the model. One possibility is to use similarity of a new word with the words already in the model (e.g., using word embeddings). Another possibility is to change the strategy for adding new words: add new words only if y = 1 and \(\hat {y} = 0\) i.e., focus on improving recall.

7 Conclusions and Further Work

In this paper, we proposed a weakly supervised algorithm to learn a bag of words model for various disasters from news corpus and then proposed a simple model transfer algorithm that augments the news-based model from a corpus of unlabeled tweets. Then we proposed a simple online learning method to enhance this model in a dynamic run-time setting. The proposed algorithms perform better then several baselines based on semi-supervised learning approaches. We also demonstrated the effectiveness of this approach on completely unseen stream of tweets. The learned model, being simply a bag of words, is easy for humans to understand and modify.

We are planning to use this approach to detect other kinds of information, such as HIV/AIDS related tweets. We are currently not allowing non-WordNet words to enter into the model, which restricts the effectiveness of the model, because tweets often contain domain-specific words not present in WordNet including hashtags (e.g., H1N1, forestfire, ebola etc.). We need to lift this restriction for wider applicability of this technique. Another limitation of this approach is that the model structure is very simple; we are investigating more complex model structures to improve the performance. For example, one can include phrases, rather than single words, in the model. We have developed techniques (based on χ2-test and test of proportions) to detect any decay in the importance of words in the model (e.g., hashtags) over time. Online learning setting here only modifies the word weights; we are exploring an optimal way to add new words or drop less useful words, based on emerging vocabulary. We are also exploring other model modification mechanisms as well as other online learning algorithms to modify the model on the fly during operational deployment.