1 Introduction

The microblog search problem has attracted researchers’ interest. The emergence of smart mobile devices with integrated application modules has caused microblog data to explode on both traditional and mobile social network platforms. People share experiences, stories and funny content using short texts, pictures and brief videos through various microblog platforms [24]. Content posted on microblog platforms concerning security topics reflect people’s views on public events, especially for disaster events. Social network platforms such as Twitter and Weibo are indispensable sources of microblog information. Public views on events—especially those concerning security—spread rapidly on these social network platforms. Searching for useful information about public events from microblog is becoming increasingly important because microblog information has become a vital source of news and life experiences.

Security topics in microblogs involve emergencies and disasters reported or discussed by users; otherwise, there is no difference between security topics and other topics from the standpoint of data features or statistical characteristics. However, from a social influence aspect, security topic information in microblogs spreads faster and wider. Disseminating security topic content on social networks can have wide-ranging social impacts [28, 47] that people with ulterior motives can capitalize on to create a panic. Thus, searching microblog content appropriately for target topics based on user-perceived utility is a hotspot in the social network information search domain. The challenge in searching microblog content for a specific security topic target is to find comprehensive sets of related content, which can be posed as an information search problem [38]. The term “user-perceived utility”, or used as “user-perceived quality” which means “user-oriented system behavior” in interaction systems [15], has been proposed by Dolotta et al. [14]. We adopt the concept combined with information retrieval evaluations to assess the learning procedure of the proposed method.

Due to the limited length, casual expression and arbitrary writing that are characteristics of microblog messages, the sentiments of short texts posted on microblog are often ambiguous [17]. These attributes distinguish microblogging searches from traditional web searches. Traditional web searching and ranking operations rely on search engines that crawl and index web content efficiently. Search engines for applications in daily lives follow the ad hoc retrieval strategy [59] are aimed at providing satisfying search results for users. The high degree of content aliasing and the prevalence of semantic noise in microblog makes applying traditional methods to microblog search and ranking problems infeasible. Thus, microblog search and ranking tasks face several challenges, meanwhile, the advent of big data and deep learning techniques provide research opportunities. Given these circumstances, deep learning-based methods such as the Deep Structured Semantic Model (DSSM) [23], Convolutional Latent Semantic Model (CLSM) [44, 45] and others have been used to solve big data information search problems. In addition, neural click-through models [5] have been proposed for web search problems.

For search problems, matching and ranking are the key components. Wang et al. [51] proposed a listwise approach for ranking-oriented collaborative filtering. Guo et al. [19] presented a deep relevance matching model for the ranking problem. Methods for rank learning combined with deep neural networks [30, 43] have been proposed to solve the ranking problem for big data searches. However, these search strategies are not suitable for microblog data features. The conventional methods of information search divide the search procedure into two main parts: matching and ranking. Ranking is usually modeled as a static process that relies on similarity metrics; however, these types of methods over-rely on matching similarities.

In microblogs, the document form and limited content length result in extreme challenges for information retrieval [41]. Microblog search usually involves the latest news and events [12], especially for security related topics that reflect complex social attributes. The existing studies have focused mainly on searches for relevant information based on semantic similarity discrimination combined with the spatiotemporal characteristics of microblogs [34]. One difficulty of microblog search is representing the contents accurately given microblog sparsity issues. Researchers have approached microblog search based on short text retrieval methods. Hasanain and Elsayed [20] studied query performance prediction for microblog search based on short texts to estimate the effectiveness of microblog search systems in the absence of feedback. Agarwal et al. [1] adopted keyword search to find contextual messages from the short-text data streams in microblogs. They revealed that microblog search based on short text retrieval tends to result in only recent messages rather than the most relevant ones, which weakens the search utility perceived by users. Therefore, how to conduct microblog search while fully considering user-perceived utility is yet another challenge for microblog search.

In this paper, we present a content search method for specific topics (MDPMS) based on deep reinforcement learning that is specifically targeted toward searching for security topics in microblogs. A novel and dynamic content relevance evaluation strategy based on the Deep Q Network (DQN) is proposed. The proposed DQN structure is composed of convolutional neural networks (CNN) and long short-term memory (LSTM) [22]. CNN is an efficient neural network architecture for learning multidimensional representations from the data [48]. LSTM takes textual feature representations and outputs action-value calculations, which are used to dynamically evaluate the matching degree of diverse search results for the corresponding position during the content matching procedure. The proposed method follows a Markov decision processes (MDP), in which we regard each subsequence of the search results as an individual state. Standard reinforcement learning methods for MDP are applied to construct appropriate result sequences [35]. From an information search perspective, we calculate action values as the matching degrees for different ranking sequences. For complicated microblog data semantic features, we utilize the CNN to compact the local semantic features covered by specific words into low-dimensional semantic representations. LSTM, which is the key component of the proposed DQN, is adapted to calculate referential action values for each microblog semantic feature from the convolutional feature representation sequences. The action values are used to select an action: whether to include the corresponding microblog content in the ranked search results list.

The proposed method provides several advantages: 1) it dynamically evaluates the microblog content search matching degree by gauging how well the microblog content accords with corresponding topics; 2) it presents a novel DQN framework composed of a CNN and an LSTM to calculate the action values; 3) it realizes raw microblog content mapping to the most appropriate ranked content list in an end-to-end manner.

For the first advantage, the matching degrees are calculated in accordance with the corresponding ranking results. The resulting sequences change in length and content when new microblog content is selected as a result at the corresponding position. Different result sequences form different search states at each time step. The evaluations are trained by reinforcement learning while the states change cyclically by episodes. For the second advantage, reinforcement learning is applied to optimize the parameters for the designed DQN constructed by the CNN text feature representations and LSTM. The well-trained DQN structure addresses the microblog content features and outputs the matching degrees at corresponding positions dynamically during the changeable search state. Finally, existing methods calculate ranking scores directly for documents based on different queries [55] but ignore the user-perceived utility of information about different search states. Instead, the proposed method models the different action values as the user perceived utility following reinforcement learning, which makes the search ranking task more intelligent.

The rest of this article is organized as follows. In Section 2, we review related works about microblog search and discusses deep reinforcement learning applications for information search. In Section 3, the fundamentals of the proposed method are presented. Section 4 provides the details of content evaluation based on the DQN. Comprehensive experiments on real-world dataset are presented in Section 5, and Section 6 concludes this paper.

2 Related works

2.1 Microblog search

Microblog services have become increasingly popular since the advent of smart mobile devices [37]. In recent years, microblog retrieval [2, 7, 11] has attracted great attention from researchers worldwide. Zhang et al. [61] studied top-k disjunction query processing issues in microblog search. They proposed a compact indexing method named “judicious searching” for microblog search in a huge dataset. Basu et al. [3] proposed a microblog search method that used context-specific stemming to capture diverse microblog contents based on word embedding. Basu et al. [4] studied the issues of microblog search in disaster situations. Wang et al. [52] proposed a feedback concept model to solve the microblog retrieval problem using query expansion. The strategy of query expansion has also been adopted to solve real-time social network search problems. Zhang et al. [60] adopted a query-biased ranking model with a semi-supervised algorithm which they used to capture query characteristics. Xia et al. [56] solved the problem of complex query analysis by studying the top-k most significant temporal keywords.

Ranking is a key component of information search. Feng et al. [16] presented a reinforcement ranking model based on graphical emoticons as sentimental labels of microblog posts. Zhang et al. [62] presented a deep learning-based method to rank themes in microblog contents by calculating the similarities among visual features, text content and microblog popularity using a deep learning framework. Song et al. [49] provided a ranking algorithm for WeChat social network content by weighting the vector space of the entry position using document content matching technology. De Maio et al. [13] introduced a method to readapt the ranking of preferred tweets (those more likely to be interesting to users). This method was based on deep learning through which a ranking model was built to measure tweets re-ranked from the top-ranked list.

2.2 Deep reinforcement learning applications on information search

Recently, deep reinforcement learning has been applied in many research fields, i.e., artificial intelligence, intelligent control, robots, and so on. Mnih et al. [35] developed a DQN that acted as a novel artificial agent to learn policies from high-dimensional sensory inputs. The agent parameterized an action-value function to play Atari games at a human performance level. Mnih et al. [36] introduced an asynchronous gradient descent method for reinforcement learning to optimize deep neural network controllers under a lightweight framework. Silver et al. [46] proposed an algorithm combining a Monte Carlo simulation with policy networks to optimize reinforcement learning from self-playing games. The Markov decision process [31, 39] is the core model of reinforcement learning and is also employed to construct ranking models. Xia et al. [50] utilized a continuous Markov decision process to construct a diverse ranking model. The model adopted a policy gradient reinforcing algorithm [8] to adjust the parameters to maximize the expected long-term discounted rewards.

Reinforcement learning strategies that follow the Markov decision process have been deployed to solve information search problems. Diversifying search result rankings [58] is an important goal in information search tasks. Xia et al. [55] formalized search ranking process as a set of sequential decision making processes that were further modeled as a continuous-state Markov decision process. The method learns to make decisions to choose which policies are used to rank documents in the corresponding positions based on search state. They used the policy gradient algorithm to train the method to earn an appropriate future reward. Keyhanipour et al. [25] applied reinforcement learning to research rank-aggregation. The method integrated data fusion and reinforcement learning algorithms such as Q learning [53] and SARSA [63] to obtain the best aggregated search result. Reinforcement learning frameworks are also used to address the challenges of high-dimensional training data. Click-through features were applied to reinforcement learning in [26]. Real-time web search [32] is another key aspect of information searching.

For large amount of data, deep representations have been shown to be indispensable elements for deep learning applications [21]. The related works concerning the use of deep reinforcement learning applications in information search show that the combination of deep representations and massive datasets provides opportunities for solving the problem of microblog searches for specific topics through deep reinforcement learning. Wei et al. [54] devised a learning to rank algorithm based on Markov decision process to calculate the ranking evaluations of all positions in the result list. The algorithm adopted policy gradient, an on-policy reinforcement learning algorithm, to realize reinforcement for the information search. In this paper, we proposed a microblog search method based on deep Q network (DQN) which is an off-policy reinforcement learning algorithm. Different from policy gradient, DQN selected appropriate actions to optimize a policy, whereas the policy gradient algorithm uses the policy to generate samples as same as the one used for updating parameters [50].

3 The proposed MDPMS method

We model the microblog search task as a continuous-state MDP [60]. For a microblog search task related to a specific security topic, we have the query content q and a set of microblog contents M = {m1, m2, …, mn}. We obtain the final ranking sequence R = {r1, r2, …, rl}(l < n) through a supervised reinforcement learning procedure modeled by MDP. Each microblog content to be searched is mapped into a latent semantic space (mn, rl∈Rd) with the help of word embedding techniques, where each content representation is in a fixed length. The structure of the proposed method is illustrated in Figure 1.

Figure 1
figure 1

The structure of the proposed microblog content-search method for security topics

The proposed method is primarily composed of a newly designed deep Q network (DQN). The DQN framework contains a pre-trained CNN framework to generate microblog content features and an LSTM with fully connected layers to generate action values. Microblog contents are selected or skipped as a search result based on the action values. The search result sequence states are recorded to provide rewards for training the proposed DQN. The details are presented in Sections 3.1 and 3.2. The proposed DQN is depicted and described in Section 4.

3.1 Formal definitions of the microblog searching and ranking procedures

Under MDP, microblog content searching and ranking for a specific security topic is formulated as a five-tuple <S, A, O, T, R>:

The state (S) stands for different search status states. Each ranking list has a corresponding state for different search statuses because S varies as the search results change. Briefly, we define state as a tuple for selected microblog contents as the search results and the encoded-user perceived utility from the selected contents.

S is designed as S = {Dt, Pt} at time step t, where Dt is the result sequence of selected microblog contents in the preceding of t time steps. Dt is defined as follows:

$$ {\mathrm{D}}_t=<{\mathrm{m}}_1,{\mathrm{m}}_2,...,{m}_t>=<{\mathrm{m}}_{(n)}{>}_{n=1}^t,{m}_{(n)}\in {\mathbb{R}}^d $$
(1)

where m(n) is the ranked microblog content at position n. Pt is the encoded user-perceived utility, like Dt, which is a variable length sequence defined as follows:

$$ {\mathrm{P}}_t=<{p}_1,{p}_2,...,{p}_t>=<{p}_{(n)}{>}_{n=1}^t,{p}_{(n)}\in \left[-1,1\right] $$
(2)

where p(n) is a scalar indicator of the user-perceived utility for the selected microblog content at the corresponding time step. In addition, m(n) and p(n) correspond to each other. At initialization, we define Dt = ∅, Pt = ∅ when t = 0.

Action (A) represents the actions to be selected for the corresponding microblog contents. For each microblog item, there are only two choices: choosing the content and including it in the results or skipping the content. At time step t, we define A as Act = {c, k}. In the action set, c represents choosing the content for state St while k means skipping the content. For state St at time step t, at = Act_DQN(St) determines whether the corresponding content mt + 1 to be selected to the result list or skipped at the current ranking position. The chosen microblog content is defined as Cm(at = c). As presented in the definition of S, the value of Pt = DQN(Cm(at = c)) acts as the action value [53] calculated by the DQN (details will be presented in next section).

Observation (O) is the observation of the current ranking environment. It is used to record the global status, including the state transformations and all the action choices. We define O as shown in Eq. (3):

$$ \mathrm{O}=<\left\{{\mathrm{S}}_0,{a}_0\right\},\left\{{\mathrm{S}}_1,{a}_1\right\},...,\left\{{\mathrm{S}}_t,{a}_t\right\}>=<\left\{{\mathrm{S}}_{(n)},{a}_{(n)}\right\}{>}_{n=1}^t $$
(3)

An observation is an expanded description of the state. The reinforcement procedure is reflected through observations.

A transition (T) is defined as the transformation of state St at time step t to the next state, St + 1, activated by the actions selected by DQN with the θ parameters. The transition is a function presented as T:S × Act_DQN(S) → S as shown in Eq. (4):

$$ {\displaystyle \begin{array}{l}{\mathrm{S}}_{t+1}=\mathrm{T}\left({\mathrm{S}}_t,\mathrm{Act}\_ DQN\left({S}_t;\theta \right)\right)\\ {}\kern1.5em =\mathrm{T}\left(<{\mathrm{D}}_t,{\mathrm{P}}_t>, at\right)\\ {}\kern1.5em =\Big\{\begin{array}{c}\left\{<{\mathrm{D}}_t\oplus {\mathrm{C}}_{m\left( at=c\right)}>,{\mathrm{P}}_t\oplus \mathrm{DQN}\left({\mathrm{C}}_{m\left( at=c\right)}\right)\right\}\ \mathrm{if}\ at=c\\ {}<{\mathrm{D}}_t,{\mathrm{P}}_t>\kern11.75em \mathrm{if}\ at=k\end{array}\end{array}} $$
(4)

where ⊕ denotes a concatenation of sequences and elements. In the expression Dt ⊕ Cm(at = c), ⊕ concatenates the sequence of Dt with the selected microblog content Cm(at = c) to form the new sequence of state St + 1 for the next time step.

A reward (R) is the evaluation of the quality of the search results for a training episode. The user-perceived utility defined in state is used to construct a reward in each episode. We adopt Normalized Discounted Cumulative Gain (NDCG) to formalize the reward representation. To acquire NDCG’s properties, the user-perceived utility Pt∈[−1, 1] at time step t is mapped to a relevance grade based on the domain segmentation (i.e., relt = switch_map(Pt)) where relt is a positive integer that represents a relevance grade. The function switch_map(·) maps Pt to a corresponding relevance grade based on the domain segmentation. The reward is defined as r_NDCG as shown in Eq. (5), by applying the NDCG calculation:

$$ {\displaystyle \begin{array}{l}\mathrm{r}\_\mathrm{NDCG}(P)=\frac{\mathrm{switch}\_ map\left({P}_1\right)+\sum \limits_{i=2}^{\mid {\mathrm{P}}_t\mid}\frac{\mathrm{switch}\_ map\left({P}_i\right)}{\log_2\left(i+1\right)}}{\sum \limits_{i=1}^{\mid {\mathrm{P}}_t\mid}\frac{2^{\mathrm{switch}\_ map\left({P}_1\right)}-1}{\log_2\left(i+1\right)}}=\frac{re{l}_1+\sum \limits_{i=2}^{\mid {\mathrm{P}}_t\mid}\frac{re{l}_i}{\log_2\left(i+1\right)}}{\sum \limits_{i=1}^{\mid {\mathrm{P}}_t\mid}\frac{2^{re{l}_i}-1}{\log_2\left(i+1\right)}}\\ {}\kern4.75em \iff \frac{\sum \limits_{i=1}^p\frac{2^{\mathrm{switch}\_ map\left({P}_i\right)}-1}{\log_2\left(i+1\right)}}{\sum \limits_{i=1}^{\mid {\mathrm{P}}_t\mid}\frac{2^{\mathrm{switch}\_ map\left({P}_i\right)}-1}{\log_2\left(i+1\right)}}=\frac{\sum \limits_{i=1}^p\frac{2^{re{l}_i}-1}{\log_2\left(i+1\right)}}{\sum \limits_{i=1}^{\mid {\mathrm{P}}_t\mid}\frac{2^{re{l}_i}-1}{\log_2\left(i+1\right)}}\end{array}} $$
(5)

The reinforcement learning algorithm optimizes the model parameters under a supervised learning strategy using click-through labeled data of microblog contents. In this paper, the proposed method achieves reinforcement mainly by relying on continuous changes in the reward.

3.2 Reinforcement learning for microblog search

In this paper, we use the off-policy strategy to design a learning algorithm for search target topic content from microblogs. Due to the complicated microblog data characteristics that distinguish such data from traditional web data, we conduct a combined structured DQN model to process the semantic feature representations and the action-value relevance evaluations. The demand to learn better features than handcrafted ones motivated us to connect reinforcement learning algorithms to deep learning methods, which are applied to operate directly on high-dimensional microblog semantic features.

Inspired by reinforcement learning [50], the algorithm for the proposed method is shown in Algorithm 1. As an off-policy strategy, the proposed method operates following deep Q learning with experience replay [35], interacting with the microblog contents with click-through labels (i.e., {<d1, L1>, <d2, L2>,…, <dl, Ll>}), where l is the length of the microblog contents to be searched.

Setting the immediate reward for the current time step is an essential component of Algorithm 1. For each time step, the result list is rebuilt with the selected content. The immediate reward is calculated by Eq. (5), while the final reward of all the procedure is presented as Step 16 in Algorithm 1. We set rt = 0 = 0 for initialization, while rt represents the final reward of the time step t. Generating action-choosing scores known as “action values” presented by the probability distributions for choosing the actions is a vital step to realize the microblog searching and ranking during the MDP. The target of the algorithm is to learn parameters for the DQN to generate action values as user-perceived utilities to select related microblog contents. Algorithm 1 is the key to learning an intelligent ranking model for microblog searching. For learning purposes, we adopt the mean-square error function as the loss function, as shown as Eq. (6), following the standard reinforcement learning process:

$$ {\displaystyle \begin{array}{l}L\left(\theta \right)=\mathrm{E}\left[{\left(\mathrm{Q}\ast \left(S;\theta \right)-\mathrm{Q}\left(S;\theta \right)\right)}^2\right]\\ {}=\mathrm{E}\left[{\left(r+\gamma \max \_a\left(\mathrm{Act}\_\mathrm{DQN}\left(S;\theta \right)\right)-\mathrm{Q}\left(S;\theta \right)\right)}^2\right]\end{array}} $$
(6)
figure a

The mean-square error loss function evaluates the mathematical expectation of the deviations between target values and real values. In the loss function, the target is Q*(S;θ), which is represented as rt + γr_NDCG(Pt ⊕ Pt + 1). In the expression, γ is the discounted factor controlling the target value change. Q(S;θ) outputs real values through which the target expression is used to estimate the action-value function.

The DQN is an action-value generator. The action values are used as the user-perceived utilities. Another key function of DQN is to analyze the semantic features of different microblog contents and queries for specific topic content. The proposed DQN combined with CNN and LSTM is described in Section 4.

4 Selecting and evaluating relevant contents based on DQN

The existing information-search methods depend on handcrafted relevance features for searching and ranking [35]. For these methods, feature quality directly determines the reliability. Furthermore, the existing methods calculate relevance scores by relying on interaction measures between the queries and documents at separate and fixed positions. The calculation procedure is regarded as a static process that neglects the effects of dynamically constructing the sub-ranking results in the final ranking result. The search method presented in Section 3 is intended to solve that problem.

As a search problem, evaluating content to obtain relevance scores is indispensable—especially for microblog contents with complex data characteristics. In this section, we propose a DQN functional framework, as shown in Eq. (4) and line 9, 13 in Algorithm 1, to process microblog semantic features under a semantic embedding space.

4.1 CNN and LSTM-based DQN to select relevant content

The typical microblog content characteristics are limited length, casual expression, and arbitrary writing. These features make searching microblog content for topics different from searching traditional web content because microblog content is mixed with large amounts of global semantic noise. To search microblog content for a specific topic, the contents are processed based on local semantic features expressed by representative words.

As a preprocessing step, word segmentation is conducted on the microblog contents and stop words are removed. In this process, the contents of a microblog m is modeled as a multidimensional vector m = <w1, w2, …, wp > using word embedding where wp∈Rwd is a fixed-dimensional word vector. We used a pretrained Word2vec algorithm [33] to create the fixed-dimensional word vectors. The length of each microblog vector is also a fixed value that is greater than the number of words in the microblog content without stop words.

To learn the local semantic features of microblog contents, we utilize the CNN framework to perform convolutional and pooling calculations and generate compact representations. The LSTM framework with two fully connected layers is then applied to calculate the action value. Finally, the action value is used to evaluate the selected microblog content for the corresponding position in accordance with the query. The proposed DQN framework is shown in Figure 2.

Figure 2
figure 2

The proposed DQN framework

As depicted in Figure 2, the CNN framework is deployed as the interface to extract the local semantic feature representations. The CNN is trained in pairwise fashion so that it is sensitive to the target topic content vectors. In the search process, the target topic contents are the corresponding queries from which the vectors are generated by the word embedding techniques. From a representation aspect, the query contents as the search target are preprocessed into the computable vectors at a uniform size in the same way as other microblog contents, as shown in Figure 2.

As stated earlier, the microblog contents are mapped into fixed-size vectors of wd×p × 1 where wd is the length of the word vectors constructing the p microblog vectors. Similar to image processing, the convolutions operate at a region of the microblog vectors followed by pooling conducted on the convolution result vectors. We perform the convolution computations at a size of ⌈wd/2⌉ × 1 (e.g., half the length of a word vector). The pooling layers have a size of ⌈wd/2 + 1⌉ × 2 and operate on the region of the convolution results. The convolutional layers and the pooling layers collaboratively calculate the semantic features of the microblog vectors. The convolutional feature representations, which involve both microblog content features and target topic content features, are generated by nonlinear transformations from the pooling result vectors. More computational details are presented in the following section. The compact feature representation sequences output by the CNN framework are input to the LSTM framework to calculate the action values. Hence, both the semantic and temporal dependencies of the content stream are considered in the LSTM. This LSTM evaluates whether the current content should be selected as a search result. The input sequence for the LSTM is composed of a series of contents that are evaluated to construct a continuous semantic and temporal state. The values output by the LSTM framework by the two fully connected layers are treated as the action values for the end stage of the input sequence. At initialization, the input sequence of the LSTM is padded with zero vectors; the sequence will be filled up by the accumulated content vectors over the time steps.

4.2 Computational details of the DQN

The aim of the convolutional layer is to extract the local semantic features contained by the representative word vectors. The convolutional filter has a size of ⌈wd/2⌉ × 1 in the first two dimensions and 1024 channels in the third dimension. The convolution filter covers half a word vector during each computation step, constructing the convolution result vector at a size of ⌈(wd-(wd/2)/2 + 1)/1⌉ × 1 × 1024, as ⌈wd/2 + 1⌉ × 1 × 1024.

More formally, we define the convolution operation as * between the content vector Vc and the convolution filter F. Following Severyn’s work [45], the convolution operation is defined as shown in Eq. (7).

$$ {\mathbf{S}}_{\mathrm{c}\mathrm{r}}=\sum \limits_{i=0}^{i+\left\lceil wd/2\right\rceil -1}\sum \limits_{j=0}^{j+p-1}{\mathbf{V}}_{\mathrm{c}\left[i:i+\left\lceil \frac{wd}{2}\right\rceil -1,j:j+p-1\right]}\ast \mathbf{F} $$
(7)

where Scr is the convolution result vector. The convolution filter covers the content vector Vc at the range of [i:i + ⌈wd/2⌉-1, j:j + p-1] (from i to i + ⌈wd/2⌉-1 at the first dimension and from j to j + p-1 at the second dimension).

The convolution result vectors are passed to the activation function to generate the input of the pooling layer. The pooling layer aggregates the information and reduces the representation. The pooling operation is defined as follows:

$$ {\mathbf{S}}_{\mathrm{pr}}=\max \_\mathrm{pool}\sum \limits_{i=0}^{i+\left\lceil wd/2\right\rceil}\sum \limits_{j=0}^{j+p-2}\left(\mathrm{ReLU}\left({\mathbf{S}}_{\mathrm{cr}\left[i:\mathrm{i}+\left\lceil \frac{wd}{2}\right\rceil, j:j+p-2\right]}+{\mathbf{b}}_{ij}\right)\right) $$
(8)

where Spr is the pooling result vector. ReLU is used as the activation function and bij is the corresponding bias. We use max-pooling to realize the pooling operation at the range of [i:i + ⌈wd/2⌉, j:j + p-2] (from i to i + ⌈wd/2⌉ at the first dimension and from j to j + p-2 at the second dimension).

The pooling filter has a size of ⌈wd/2 + 1⌉ × 2 at the first two dimensions and 2048 channels at the third dimension. The filter covers generate the result vector at a size of ⌈wd/2 + 1-(wd/2 + 1 + 1)/2⌉ × (p-2 + 1) × 1024, or 1 × (p-1) × 2048.

In the learning phase, we use a pairwise method to analyze the target contents together with the microblog contents to cause the CNN to be sensitive to the target representation vectors. The resulting generated convolution feature representations Rcf are associated with both the microblog content features and target topic content features. The CNN utilizes the local perception properties to process the embedded semantic vectors of microblog contents and target topic contents. The local perception property of the CNN meets the special demands of microblog data characteristics to capture local semantic features among the large amounts of semantic noise. This supervised learning process generates the convolutional feature representations of microblog and target topic contents. To improve the coherence of the semantic information, an LSTM with two fully connected layers is used to analyze the large amounts of convolution feature representations. The goal of the LSTM is to yield an appropriate action value to guide the reinforcement learning algorithm to make a proper action choice.

LSTM is a Recurrent Neural Network (RNN) structure whose blocks are composed of a cell, an input gate, an output gate and a forget gate. The cell obtains new input information each time when the input gate it is activated at time step t. The final state ht receives the latest cell output ct when the output gate ot is on. The forget gate is activated when the previous cell output ct-1 should be forgotten. Under this strategy, the gradient will be trapped in the cell and prevented from vanishing too quickly [57]. The convolution features Rcf generated by the CNN are input into the LSTM in sequence. In this paper, we follow the formulation of Graves’ work [18] to present the model shown in Eq. (9).

$$ {\displaystyle \begin{array}{l}{i}_t=\sigma \left({\mathbf{W}}_{\mathrm{R}i}{\mathbf{R}}_{\mathrm{cf}\_t}+{\mathbf{W}}_{hi}{h}_{t-1}+{\mathbf{W}}_{ci}\circ {c}_{t-1}+{\mathbf{b}}_i\right)\\ {}{f}_t=\sigma \left({\mathbf{W}}_{\mathrm{R}f}{\mathbf{R}}_{\mathrm{cf}\_t}+{\mathbf{W}}_{hf}{h}_{t-1}+{\mathbf{W}}_{cf}\circ {c}_{t-1}+{\mathbf{b}}_f\right)\\ {}{c}_t={f}_t\circ {c}_{t-1}+{i}_t\circ \tanh \left({\mathbf{W}}_{\mathrm{R}c}{\mathbf{R}}_{\mathrm{cf}\_t}+{\mathbf{W}}_{hf}{h}_{t-1}+{\mathbf{b}}_c\right)\\ {}{o}_t=\sigma \left({\mathbf{W}}_{\mathrm{R}o}{\mathbf{R}}_{\mathrm{cf}\_t}+{\mathbf{W}}_{ho}{h}_{t-1}+{\mathbf{W}}_{co}\circ {c}_t+{\mathbf{b}}_o\right)\\ {}{h}_t={o}_t\circ \tanh \left({c}_t\right)\end{array}} $$
(9)

where σ is the logistic sigmoid function, ° denotes the Hadamard product, and it, ft, ct, ot, and ht are the status values of input gate, forget gate, cell state, output gate and final state, respectively at time step t. Rcf_t is the convolution feature input into the LSTM at time step t and W and b are the weight and bias parameters for the corresponding components of LSTM. LSTM is connected with two fully connected layers. The output of the LSTM, lr, is input into these two layers to generate LSTM features lfc. The action value is generated by Softmax as values distributions for different actions formatted as Eq. (10).

$$ {\displaystyle \begin{array}{l}{\mathbf{l}}_{\mathrm{fc}}=\mathrm{ReLU}\left({\mathbf{W}}_{\mathrm{fc}}\hbox{'}{\mathbf{l}}_{\mathrm{r}}+{\mathbf{b}}_{\mathrm{fc}}\hbox{'}\right)\\ {} Action\_ value=\mathrm{Softmax}\left({\mathbf{W}}_{\mathrm{fc}}{\mathbf{l}}_{\mathrm{fc}}+{\mathbf{b}}_{\mathrm{fc}}\right)\end{array}} $$
(10)

where ReLU is the activation function used in Eq. (8), Wfc and bfc are the weight and bias parameters in the first fully connected layer—similar to the Wfc and bfc in the second layer.

We use softmax as the output of the fully connected layers to obtain the choice probability distribution after the two fully connected layers. The final output of the associated framework applies the values of the function of Act _ DQN(⋅) as the action-value distribution for “Choose” and “Skip.”

5 Experiments and analysis

We conducted experiments to evaluate the proposed method of MDPMS on the real-world dataset from Sina Weibo. We selected four security topics as the specific search topics. The performances of the state-of-art information search methods based on traditional techniques and deep neural networks are evaluated for searching security topics.

5.1 Dataset

To search security topics in microblog content, we collected a dataset from Sina Weibo covering from June 10th, 2012, to September 7, 2016, containing 385,712 posts that included both relevant content and non-relevant content regarding the four selected security topics: Kunming terrorist attacks, Tianjin explosions, rainstorms in Hubei and fake vaccines in China. The proposed MDPMS method is trained through supervised learning with labeled data. The data—including the noise—of the security-related topics is randomly divided, with 70% used for training (including testing) and 30% used for validation. The statistics for the relevant contents as positive samples of these four events are shown in Table 1.

Table 1 Statistics of four events of microblogs

We split the contents of the four security topics into a training set and a validation set. The purpose is to ensure that the datasets include different specific topics that can reflect the commonality of security topics. The model is trained to evaluate common local semantic feature representations for the target topics, and the training set and the validation set both contain instances of all four specific security topic contents. The training procedure is conducted under supervised learning with labeled data by click-through.

The procedure of labeling the ranked position of the dataset is as follows. At first, the traditional query-likelihood language model [42] is used to simulate the general query process of a user to select positive samples from the dataset. The simulation is also used to acquire target contents and intentions of users under the hypothetical situation. The labeled data will be further recognized during the simulated situation, in which the action values represent the user-perceived utility. Then, we repeat the process 150 times to get the ranked positons of the contents in the result list. Among the results, the ranked position labels for the contents whose ranked positions are not changed are determined. For the remaining labels of the ranked positions, we manually adjust them to get the appropriate labels.

Following the click-through [44] of information retrieval, we labeled the dataset manually based on the results of queries using multiple keywords that are representative of the corresponding security topics. We also created sublabels for the dataset at different relevant levels in conjunction with the semimanual labels to improve the convenience of further evaluations. The relevance of these labels is determined during the training phase. Further evaluations are made using NDCG and MAP.

5.2 Experimental settings

MDPMS is based on an off-policy reinforcement learning algorithm. We initialized the algorithm by assigning parameters including the greed coefficient ε, the greed increment coefficient ic, the discount coefficient for reward γ and the parameter learning rate [29] η. The initial parameter values are shown in Table 2.

Table 2 Initializations of parameters

From Table 2, the greed coefficient ε controls how the algorithm select actions based on experiences. Initially, the algorithm needs to explore every possible action for different contents to build experience when it has no prior experience to rely on. This is why the greed coefficient is initially set to ε = 0.01—to ensure that the algorithm is not greedy at initialization. The algorithm learns how to select suitable actions gradually for different microblog contents to gain higher rewards. This incremental greed process is controlled by the greed increment coefficient. The algorithm becomes greedier as the learning process progresses. The greed coefficient is updated by ε = ε × ic at each time step. The value range of the greed coefficient ε is a float value ranging from 0.01 to 0.9; the 0.9 limit ensures that the algorithm will never become totally greedy. The reward is updated when a new chosen content is selected and placed in the result list. The changing reward makes the algorithm more intelligent, helping it to know which action should be chosen when faced with different contents. However, the algorithm selects an action for the upcoming content under a greedy strategy: eventually, there is only a 10% chance that it will explore new possibilities for similar content to update its older experiences. This situation reflects the fact that its initial experiences are more valuable. The discount coefficient for reward is intended to balance the changing experience referential value shown in Eq. (6) and in Line 16 in Algorithm 1. The learning rate is a traditional concept in machine learning that controls the neural network parameter updating.

The algorithm is reinforced in accordance with the evaluations of action values and rewards. During the learning stage, the rewards are the feedbacks from the phased results of all the different episodes, from which the algorithm adjusts the action-selecting strategy for various microblog contents to obtain a better future reward. In each time step, an action value is generated to evaluate the action selected for a corresponding microblog content. This is the algorithm interface that analyzes the semantic features. The final outcome of the analysis process is conducted by DQN framework. The pretrained Word2vec generates a 60-dimensional vector for each word. The fixed-size microblog vector is a multiple-dimensional vector composed of word vectors; the first dimension of a microblog vector depends on the number of word vectors with stop-words removed. We define the microblog vector, which has a size of 60 × 100, as wd = 60, p = 100 in Figure 2, where 100 is larger than the word-count value of the longest microblog entry. Another key component is the LSTM, which receives sequences of convolution feature representations and calculates action values.

5.3 Parameter sensitivity experiments

We conduct experiments to verify the effectiveness of the proposed method. The search goal is to select related microblog entries by relying on the action selected for different entries. As presented in section 3, the actions are defined as “choose” and “skip.” The algorithm selects the appropriate action for different entries to determine which microblog contents should be included in the search results; thus, the actions selected for different specific entries directly determine the search results. We randomly selected 2000 microblog entries from the dataset to evaluate the changing process of choosing the actions. The 2000 posts were divided into 200-item mini-batch subsets. Each subset contains 10 posts to verify the average action values at different training phases under the greedy mechanism, and each content subset was unique. To present the results intuitively, we redefined the action value as a binary value (1 or − 1), where the “choose” action-value is 1 and the “skip” action value is −1. The average action-value curves during training are shown in Figure 3.

Figure 3
figure 3

Average action values at different training phases

The subsets are input in sequence; then, the action-selection trend for different subsets over a given set of time steps is shown in Figure 3. The reinforcement is reflected in the procedure used to gain action-selection experience to select different microblog posts as the search results. We use subsets of the selected samples to show the changes in the average action values for different subsets evaluated in sequence. As shown in Figure 3, we selected four training phases to demonstrate how the average action values change for different subsets. The subsets are input as a sequence. The action-value distributions on some subsets of the four training phases fall into the positive interval, meaning that overall, the algorithm selected “choose” more often than “skip.”. Furthermore, the proportion of “choose” actions begins an upward trend between 1000 and 1500 training iterations for the 2000 content items. In contrast, a significant decline occurs after 2000 training iterations. This trend shows that the algorithm becomes greedy and stops exploring different possible actions on the same content. The fact that algorithm becomes greedy means that the greed coefficient has degenerated to a fixed-value. However, the algorithm parameters tend to converge at approximately 2500 training iterations. As presented in (a) to (c), the fluctuation center changes become stronger. After 2500 training iterations, the fluctuation center changes becomes similar to that after 2000 training iterations, showing that the training tends to converge.

We also evaluated the efficiency of the action selections based on the randomly selected contents as shown in Figure 4. The four images in Figure 4 present changes in the effective action ratios based on the input of different subset sequences. The effective actions mean choosing instead of skipping the correct contents should be picked up for the corresponding ranking positions. Intuitively, The more effective actions the higher user-perceived search utility. A series of effective actions make up the appropriate result list. Figure 4 shows the effectiveness of the proposed method, where we calculated the ratio of effective actions in all the actions the algorithm made for constructing the search result list from different training phrases. Over the time steps for the training process, the actions selected by the algorithm for different subsets of microblog contents tend to be reasonable and accordingly, the effective actions ratio (the ratio of appropriate actions) changes as well.

Figure 4
figure 4

Effective actions ratios at different training phases

Figure 4 shows the changing process of the effective actions ratio for different subsets input as a sequence. In Figure 4 (a) and (b), the curves fluctuate mainly within a range of 0.6–0.8. In contrast, the curves in Figure 4 (c) and (d) fluctuate primarily within a range of 0.7–0.9. There is an approximate 0.1-unit increase between 1000 and 2500 training iterations, while a stable trend appears from 2000 to 2500 training iterations. The stable curve trend confirms that the algorithm tends to converge after approximately 2500 training iterations. The increasing trend of the effective actions ratio demonstrates that the algorithm updates its parameters effectively to take appropriate actions for corresponding microblog entries.

As shown in Figures 3 (c), (d) and 4 (c), (d), the algorithm “skips” more content from 2000 to 2500 training iterations as the effective actions ratio stabilizes. This situation indicates that the algorithm has learned relatively suitable parameters for selecting the appropriate actions for corresponding content. The phenomenon presented in Figures 3 and 4 shows that the algorithm is sensitive to the semantic features of the security topics when selecting the most related contents as the search results.

We computed the loss values to verify the learning efficiency of the reinforcement process, as shown in Figure 5. Different from traditional learning methods, an increasing phase occurs initially. Under the greedy mechanism, the reinforcement learning algorithm gains action-selection experience by exploring all the actions for different content items under the control of the greed coefficient. The loss values curve of shows the changing trends of the average action values and effective actions ratios at different training phases. The loss begins to decrease at approximately 700 iterations and converges after 2000 iterations.

Figure 5
figure 5

Effective action ratios at different training phases

In the next sections, evaluations on the ranked search results lists are presented to verify the effectiveness of the proposed method.

5.4 Microblog searching results and analysis

The proposed MDPMS method was trained under supervised learning by numerous labeled microblog contents. In experiments, we selected 100 queries from the four specific security topics (25 from each) to verify the searching efficiency of the proposed method. The search experiments were conducted on the verification dataset combined with the four topics to verify the universality of the method used for identifying general security-topic content from microblogs. The different queries result in different ranked search result lists. The evaluations are operated on the average values of the evaluation metrics calculated from the search results for different values.

We adopt Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP) [57] as metrics to evaluate the search results for the top n ranking (e.g., NDCG@n and MAP@n). NDCG is calculated as shown in Eq. (5).

The MAP is calculated as shown in Eqs. (11) and (12).

$$ AveP=\frac{1}{n}\sum \limits_q\left(P@n\times r\right) $$
(11)
$$ MAP=\frac{\sum_{q=1}^{\mid Q\mid } AveP(q)}{\mid Q\mid } $$
(12)

where r is the relevance score assigned to the content at position n with respect to a given query q.

We also conducted comparative experiments to verify the efficiency of the proposed method. We compared MDPMS with state-of-the-art methods for searching information, including BM25, Aho–Corasick DSSM, CLSM, RankNet and ListNet.

BM25 [40] uses a set of functions based on the bag-of-words model that ranks content based on query terms.

Aho–Corasick [10] is a type of dictionary-matching algorithm that locates elements of a finite set of strings.

RankNet [6] is a neural network learning-to-rank model for the underlying ranking trained by a probabilistic loss function using gradient descent.

ListNet [9] is a listwise learning-to-rank model for information search based on permutation probability and top n probability.

DSSM [23] is a deep neural network model that represents text strings in a latent semantic space and calculates the semantic similarities between queries and content.

CLSM [44] is a latent semantic model that incorporates convolution network structures over word vectors to find similarities between search queries and content.

MDPRank [21] is a learning-to-rank model for information search on the basis of MDP.

As listed in Table 3, the performances of the proposed MDPMS method and the baseline methods are evaluated using the NDCG@5, NDCG@10, NDCG@15 and NDCG@20 metrics on their average values. NDCG is applied to define the reward following the Markov decision process, which is the key factor in the method’s performance. The average NDCG@n evaluates different levels of relevance degrees for the search results. We output the rewards as the four metrics for the top 5, top 10, top 15 and top 20 search results of the validation experiments and calculated the four values for the baseline methods in comparative experiments using the same validation set.

Table 3 Comparison of MDPMS and baseline methods on average NDCG@n

Table 3 shows a comparison of the search result metrics for the top 5, top 10, top 15 and top 20 results. Our method outperformed all the selected baseline methods. MDPMS is trained to achieve a high evaluation value on NDCG during the training phase, where the selected actions for different microblog contents guide it to construct an appropriate search result to gain a high reward. The reward is defined by integrating NDCG, through which MDPMS is validated with the goal to match the training level with the NDCG. In contrast, the baseline methods produce search results from a static procedure based on similarity functions or a pretrained model. In particular, the BM25 model is a representative information search method used in the traditional web search domain that is based on Term Frequency-Inverse Document Frequency (TF-IDF); however, this property is not well-matched with the characteristics of microblogs, which contain casual expressions and arbitrary writings.

We also used MAP@n to evaluate the average Precision@n values of relevant content in the result lists. These values are shown in Table 4.

Table 4 Comparison of MDPMS and baseline methods on average MAP@n

As shown in Table 4, MDPMS outperforms the selected baseline methods for all the average values (MAP@5, MAP@10, MAP@15 and MAP@20). The definition of the MAP calculation in Eq. (12) and Eq. (13) demonstrate a property of MAP: the higher the rank of the relevant content is, the higher the MAP values are. The values of the average NDCG@5, Precision@5 and MAP@5 of MDPMS indicate that the proposed method has advantages in relevance ranking for security topic content searches in microblogs. The overall evaluations demonstrate that the proposed MDPMS method performs better than the selected baseline methods for security topic content searches in microblog. The baseline methods stem from traditional web search methods, learning-to-rank methods and deep learning methods. BM25 is the representative web search method based on TF-IDF which does not adapt well to the casual expressions and arbitrary writing common in microblogs, especially when performing a content search for a specific topic. The Aho–Corasick method is a string-matching algorithm that can be used to perform content searches. The experimental results show that the traditional search methods based on IF-IDF or string matching perform worse than do the learning-based methods. DSSM and CLSM learn semantic features based on deep neural networks to search target content by the similarities calculated from latent semantic spaces. However, these two deep learning search methods model the similarities between queries and contents based on global semantic feature representations, which are not robust to the semantic noise produced by the casual expressions and arbitrary writing in microblogs. RankNet and ListNet are two learning-to-rank approaches based on pairwise and listwise learning, respectively. They are also used by search engines in practice. RankNet is approximated by a classification problem that relies on labeled training data. ListNet is a listwise learning-to-rank approach that tries to directly optimize the values of evaluation measures. However, it is difficult for this approach to perform such optimizations without approximations or bounds because most of the evaluation measures are not continuous functions. Furthermore, another information search method (MDPRank) based on Markov decision process is tested as baseline method too. As shown in Tables 3 and 4, the performances of MDPRank on NDCG and MAP are close to MDPMS. However, MDPRank adopts the policy gradient algorithm of an on-policy strategy in reinforcement learning. The method forms microblog contents as actions to be chosen makes the policy space more complicated in gradient calculations than the off-policy methods for information search in social networks.

We also conducted the search experiment on another query set with more query items (2000). The NDCG and MAP evaluations are computed following the definitions in Eq. (5), Eq. (12), and Eq. (13). The average NDCG and MAP scores for the top 5, top 10, top 15 and top 20 results are presented in Tables 5 and 6.

Table 5 Comparison of MDPMS and baseline methods on average NDCG@n of 2000 queries
Table 6 Comparison of MDPMS and baseline methods on average MAP@n of 2000 queries

As shown in Tables 5 and 6, the MDPMS NDCG evaluation values of 2000 queries are lower than the ones on 100 queries by 0.02 to 0.07, while its MAP evaluation values are reduced by 0.02 to 0.19. For the baseline methods, the evaluation values on NDCG and MAP are reduced to varying degrees. However, the numerical distributions of Tables 5 and 6 are consistent compared with those of Tables 3 and 4. As shown in Tables 5 and 6, RankNet performs better than the other selected methods for Top 15 in NDCG and Top 15 and 20 in MAP. We decide to select every 5 documents of contents in the ranking list as a step. RankNet shows better MAP performances at Top 15 and Top 20, of which the situation illustrates RankNet returned more accurate results according to the query. Furthermore, these results are ranked at the front positon. However, according to the NDCG evaluations, the proposed method returns more related contents overall.

The key factors of microblog search on specific topics contain effective strategies for matching, ranking and analyzing the semantic features of microblog contents. In contrast to traditional web search problems, searching microblog content for specific topics requires intelligent search strategies and different content features based on user-perceived search utility. Another difference involves the semantic features analysis processes to match content appropriately based on the queries.

The proposed method defines a microblog search as an MDP. Under this definition, the microblog search is constructed as a process of choosing and ranking results dynamically. NDCG is applied to define the reward to that forms the reinforcement during the training phase. The reward process is modeled as a part of the MDP state. As the definition of the reward, measures are evaluated dynamically based on action-dependent rewards for different microblog contents. Content analysis of semantic features on specific topics is an indispensable component for searching and ranking content. Because of the casual expression and arbitrary writing characteristics in microblog content, a deep Q network is designed to analyze semantic features implied by words representative of a specific topic based on deep Q learning.

We take the top 5 results on the topic of “Tianjin explosions” from microblogs searched by the proposed method MDPMS as an example. The query and these top 5 results are presented in Table 7.

Table 7 The top 5 search results on the topic of “Tianjin explosions”

As shown in Table 7, the research results can reflect the semantics expressed by the query content. The expreriments results show the effectiveness of MDPMS. Some other aspects are also worth discussing. In microblog contents posted by users, some subtopics of the main topic inevitably exist; the second and the fifth results show this the phenomenon. The subtopics include a series of discussions caused by the main topic. However, the results are similar to the query topic; therefore, they are found by the content searching for the security topic in microblog entries.

5.5 Cross-validation

We conducted a k-fold cross-validation [27] experiment to verify the effectiveness of the proposed method from a machine learning aspect. We adopted k = 5 to randomly partition the original dataset into 5 subdatasets of equal size. There is no overlap between these 5 subdatasets. In accordance with k-fold cross-validation, there are 5 validations (each subdataset is retained as a validation subdataset once) and the other 4 subdatasets are used as the training set. The prediction performances resulting from this cross-validation are presented in Figure 6, and the average precision values are shown in Table 8. As comparison methods for this cross-validation, we selected the existing learning-based methods, including DSSM, CLSM, RankNet and ListNet, as mentioned above.

Figure 6
figure 6

The precision of the 5-fold cross-validation

Table 8 The average precision values from the 5-fold cross-validation

As shown in Figure 6 and Table 8, the proposed MDPMS method outperforms the other selected learning-based methods. RankNet and ListNet focus on ranking from the pairwise and listwise aspects, respectively to solve the search problem. These two methods performed better when searching for related security topic content in microblogs during the second, third and fourth validations. DSSM and CLSM solve the problem from the aspect of semantic matching based on maximizing the click probability of the relevant documents. These methods are unable to meet the requirements of microblog searching because of the different semantic features of microblogs compared to those of web search. The proposed method MDPMS adopts the Markov decision process of reinforcement learning to search security-topic-related content for an appropriate user-perceived utility. Its performances during the cross-validation show its effectiveness for security-related content search tasks in microblogs.

6 Conclusions

In this paper, we proposed a method based on reinforcement learning (MDPMS) intended for searching for specific topics in microblogs. The method models the microblog search for specific topics as an MDP. A novel deep Q network combined with CNN and LSTM was designed to analyze the local semantic features for the target topics which were used to select appropriate actions (choose or skip) for the microblog entries to assemble the search results. The method evaluates content relevance dynamically through reinforcement learning instead of ranking based solely on similarities. Following the MDP reinforcement learning process, a reward based on NDCG was defined to model user-perceived search utility. In contrast to traditional web search methods, the proposed method focuses on intelligent strategies for searching and evaluating content relevance in accordance with typical microblog data features. The results of experiments based on real-world data showed that the proposed method outperformed the other baseline methods. The results also verified that intelligent search strategies and evaluations of content relevance are important to perform microblog searches on specific topics.