Keywords

1 Introduction

This volume illustrates how quantum-like models can be exploited in Information Retrieval (IR) and other decision making processes. IR is a special and important instance of decision making because, when searching for information, the users of a retrieval system express their information needs through behavior (e.g., click-through activity) or queries (e.g., natural language phrases), whereas a computer system decides about the relevance of documents to the user’s information need. By nature, IR is inherently an interactive activity which is performed by a user accessing the collections managed by a system through very interactive devices. These devices are immersed in a highly dynamic context where not only does the user’s queries rapidly evolve but the collections of documents such as news or magazine articles also use words with different meanings. The main link between the “quantumness” of these models and IR is established by the vector spaces, which have for a long time been utilized to design modern computerized systems such as the search engines and they are currently the foundation of the most advanced methods for searching for multimedia information.

Whatever the mathematical model or the retrieval function, documents and queries are mathematically represented as elements of sets, while the sets are labeled by words or other document properties. Queries, which are the most used data for expressing information needs, are sets or sequences of words or they are sentences expressed in a natural language; queries are oftentimes very short (e.g., one word) or occasionally much longer (e.g., a text paragraph). It is a matter of fact that the Boolean models for IR by definition view words as document sets and answer search queries with document sets obtained by set operators; moreover, the probabilistic models are all inspired to the Kolmogorov theory of probability, which is related to Boole’s theory of sets; in addition, the traditional retrieval models based on vector spaces are eventually a means to provide a ranking or a measure to sets because they assign a weight to words and then to documents in the sets labeled by the occurring words. The implementation of content representation in terms of keywords and posting lists reflects the view of words as sets of documents and the view of retrieval operations as set operators. In this chapter, we will explain that a document collection can be searched by vectors embedding different words together, instead of by distinct words, by using the ultimate logic of vector spaces, instead of sets.

Representing words is fundamental for tasks which involve sentences and documents. Word embedding is a family of techniques that has recently gained a great deal of attention and aims at learning vector representation of words that can be used in these tasks. Generally speaking, embedding mainly consists in adopting a mapping, in which a fixed-length vector is typically used to encode and represent an entity, e.g., word, document, or a graph. Technically, in order to embed an object X in another object Y , the embedding is an injective and structure-preserving map f : X → Y , e.g., user/item embedding [6] in item recommendation, network embedding [23], feature embedding in manifold learning [89], and word embedding. In this chapter, we will focus on word embedding techniques, which embed words in a low-dimensional vector space.

Word embedding is driven by the Distributional Hypothesis [33, 38], which assumes that linguistic items which occur in similar contexts should have similar meanings. Methods for modeling the distributional hypothesis can be mainly divided into the following categories:

  • Vector-space models in Information Retrieval, e.g., [121], or representation in Semantic Spaces [67]

  • Cluster-based distributional representation [17, 63, 79]

  • Dimensionality reduction (matrix factorization) for document-word/word-word/word-context co-occurring matrix, also known as Latent Semantic Analysis (LSA) [24]

  • Prediction based word embedding, e.g., using neural network-based approaches.

LSA was proposed to extract descriptors that capture word and document relationships within one single model [24]. In practice, LSA is an application of Singular Value Decomposition (SVD) to a document-term matrix. Following LSA, Latent Dirichlet Allocation (LDA) aims at automatically discovering the main topics in a document corpus. A corpus is usually modeled as a probability distribution over a shared set of topics; these topics in turn are probability distributions over words, and each word in a document is generated by the topics [12]. This paper focuses on the geometry provided by vector spaces, yet is also linked to topic models, since a probability distribution over documents or features is defined in a vector space, the latter being a core concept of the quantum mechanical framework applied to IR [68, 69, 110].

With the development of computing ability for exploiting large labeled data, neural network-based word embedding tends to be more and more dominant, e.g., Computer Vision (CV) and Natural Language Processing. In the NLP field, neural network-based word embedding was firstly investigated by Bengio et al. [7] and further developed by [21, 75]. Word2vec [70]Footnote 1 adopts a more efficient way to train word embedding, by removing non-linear layers and other tricks, e.g., hierarchical softmax and negative sampling. In [70] the authors also discussed the additive compositional structure, which denotes that word meanings can be composited with the addition of their corresponding vectors. For example, king −man = queen −women = royal. This capability of capturing relationships among words was further discussed in [35] where a theoretical justification was provided. More importantly, Mikolov et al. [70] published open-source well-trained general word vectors, which made word embedding easy to use in various tasks.

In order to intuitively show the word vectors, some selected words (52 words about animals and 110 words about colors) are visualized in a 2-dimensional plane (as shown in Fig. 1) from one of the most popular Glove word vectors,Footnote 2 in which the position of the word is according to the reduced vector through a dimension reduction approach called T-SNE. It is shown that all the words are nearly clustered into two groups about colors and animals, respectively. For example, the word vectors of “rat” and “dog” are close to the word “cat,” which is intuitively consistent to the Distributional Hypothesis since they (“cat” and “rat,” or “cat” and “dog”) may co-occur together with high frequencies.

Fig. 1
figure 1

The visualization of some selected words

Word embedding provides a more flexible and fine-grained way to capture the semantics of words, as well as to model the semantic composition of bigger-granularity units, e.g., from words to sentences or documents [71]. Some applications of word embedding will be discussed in Sect. 3. Although word embedding techniques and related neural network approaches have been successfully used in different IR and NLP tasks, they have some limitations, e.g., the polysemy and out-of-vocabulary problems. These issues have motivated further research in word embedding; Sect. 4.2 will discuss some of the current trends in word embedding that aim at addressing these issues. Moreover, we will discuss the link between the word vector representations and state-of-the-art approaches in modeling thematic structures.

2 Background

2.1 Distributional Hypothesis

Word embedding is driven by the Distributional Hypothesis [38]. The core of distributional hypothesis states that linguistic items with similar distributions have similar meanings and hence words with similar distributions should have similar representations. The distributional property is usually induced from document or textual neighborhoods (like sliding windows).

Some of the methods relying on the Distributional Hypothesis and the basic idea underlying them are reported below:

  • Language model p(w k|w kt, w kt+1, …w k−1): predicts the current word using previous words [7].

  • Sequential scoring p(w kt, w kt+1, …w k): predicts whether the given sentence is a legal one [21].

  • Skip-gram p(w k| ∀w i ∈{w i| abs(k − i) < t} ): predicts a co-occurring word for each word [70].

  • CBOW p(w k|w kt, w kt+1, …w k−1, …w k+t): predicts a target word with context words (both previous ones and following ones) [70].

  • Glove p(#(w i, w j)window|w i, w j): predicts the co-occurring count between a word pair [78].

2.2 A Brief History of Word Embedding

While the Distributional Hypothesis was proposed many decades ago, the techniques of word embedding trained in a neural network has a much shorter history of about one and half decades [7], as mentioned in Sect. 1. Some typical ways to generate word vectors are introduced below.

NNLM

The Neural Network Language Model (NNLM) [7] preliminarily aims to build a language model, while learning word embedding is not the main target. However, this is the first work in learning word vectors in a neural network (Fig. 2).

Fig. 2
figure 2

NNLM concatenates all the word vectors in a sentence and then predicts the next word. → refers to the information flow in the forward neural network, while the circle denotes the neurons in the network. |V | is the size of the word vocabulary [58]

C&W

The Collobert and Weston (C&W) approach was proposed in [21] in order to predict the fluency of a given sequence—see Fig. 3. One of the tasks in [21] assigns language modeling as a simple binary classification task: “if the word in the middle of the input window is related to its context or not” [21].

Fig. 3
figure 3

C&W concatenates all the word vectors to predict whether it is a natural sentence or if it has replaced the center word with a random word [58]

Skip-Gram

Skip-gram balances a trade-off between performance and simplicity. As shown in Fig. 4, Skip-gram uses a word to predict one of its neighboring words.

Fig. 4
figure 4

Skip-gram directly uses one word to predict its neighboring word [58]

CBOW

As shown in Fig. 5, CBOW uses context words to predict the current word. The difference between Skip-gram and CBOW is that in order to predict the target word, CBOW uses many words as the context, while Skip-gram uses only one neighboring word.

Fig. 5
figure 5

CBOW uses the average embedding of the contextual words to predict the target word, where the contextual words are surrounded by the target word [58]

Glove

Another popular word embedding named GloveFootnote 3 [78] takes advantage of global matrix factorization and local context window methods. It is worth mentioning that [60] explains that the Skip-gram with negative sampling derives the same optimal solution as matrix (Point-wise Mutual Information (PMI)) factorization.

3 Applications of Word Embedding

According to the input and output objects, we will discuss word-level applications in Sect. 3.1, sentence-level applications in Sect. 3.2, pair-level applications in Sect. 3.3, and seq2seq generation applications in Sect. 3.4. These applications can be the benchmarks to evaluate the quality of word embedding, as introduced in Sect. 3.5.

3.1 Word-Level Applications

Based on the learned word vector from a large-scale corpus, the word-level property can be inferred. Regarding single-word level property, word sentiment polarity is one of the typical properties. Word-pair properties are more common tasks, like word similarity and word analogy.

The advantage of word embedding is that: all the words, even from a complicated hierarchical structure like WordNet [31],Footnote 4 are embedded in a single word vector, thus leading to a very simple data structure and easy incorporation with a downstream neural network. Meanwhile, this simple data structure, namely a word-vector mapping, also provides some potential to share different knowledge from various domains.

3.2 Sentence-Level Application

Regarding sentence-level applications, the two typical tasks are sentence classification and sequential labeling, depending on how many labels the task needs. For a given sentence, there is only one final label for the whole sentence for text classification, where the number of labels in the sequential labeling is related to the number of tokens in the sentence (Fig. 6).

Fig. 6
figure 6

Sentence-level applications: sentence classification and sequential labeling

Sentence Classification

Sentence classification aims to predict the possible label for a given sentence, where the label can be related to the topic, the sentimental polarity, or whether the mail is spam. Text classifications were previously overviewed by Zhai [1], who mainly discussed the traditional textual representation. To some extent, trained word embedding from a large-scale external corpus (like Wikipedia pages or online news) is commonly used in IR and NLP tasks like text classification. Especially for a task with limited labeled data, in which it is impossible to train effective word vectors (usually with one hundred thousand parameters that need to be trained) due to the limited corpus, pre-trained embedding from a large-scale external corpus could provide general features. For example, average embedding (or with a weighted scheme) could be a baseline for many sentence representations and even document representations. However, due to the original error for the embedding training process in the external corpus and the possible domain difference between the current dataset and external corpus, adopting the embedding as features usually will not achieve significant improvement over traditional bag-of-word models, e.g., BM25 [88].

In order to solve this problem, the word vectors trained from a large-scale external corpus are only adopted as the initial value for the downstream task [51]. Generally speaking, all the parameters of the neural network are trained from scratch with a random or regularized initialization. However, the scale of the parameter in the neural network is large and the training samples may be small. Moreover, the trained knowledge from another corpus is expected to be used in a new task, which is commonly used in Computer Vision (CV) [41]. In an extreme situation, the current dataset is large enough to implicitly train the word embedding from scratch; thus, the effect of pre-initial embedding could be of little importance.

Firstly, multi-layer perception is adopted over the embedding layers. Kim et al. [51] first proposed a CNN-based neural network for sentence classification as shown in Fig. 7. The other typical neural networks named Recurrent Neural Network (and its variant called Long and Short Term Memory (LSTM) network [43] as shown in Fig. 8) and Recursive Neural Network [36, 81], which naturally process sequential sentences and tree-based sentences, are becoming more and more popular. In particular, word embedding with LSTM encoder–decoder architecture [3, 18] outperformed the classical statistic machine translation,Footnote 5 which dominates machine translation approaches. Currently, the industrial community like Google adopts completely neural machine translation and abandons statistical machine translation.Footnote 6

Fig. 7
figure 7

CNN for sentence modeling [52] with convolution structures and max pooling

Fig. 8
figure 8

LSTM. The left subfigure shows a recurrent structure, while the right one is unfolded over time

Sequential Labeling

Sequence labeling aims to classify each item of a sequence of observed value, with the consideration of the whole context. For example, Part-Of-Speech (POS) tagging, also called word-category disambiguation, is the process of assignment of each word in a text (corpus) to a particular part-of-speech label (e.g., noun and verb) based on its context, i.e., its relationship with adjacent and related words in a phrase or sentence. Similar to the POS tagging, the segment tasks like Named Entity Recognition (NER) and word segment can also be implemented in a general sequential labeling task, with definitions of some labels like begin label (usually named “B”), intermediate label (usually named “O”), and end label (usually named “E”). The typical architecture for sequence labeling is called BiLSTM-CRF [46, 59], which is based on bidirectional LSTMs and conditional random fields, as shown in Fig. 9.

Fig. 9
figure 9

LSTM-CRF for named entity recognition [59]

Document-Level Representation

Similar to the methods for sentence-level representation, a document with mostly multiple sentences, which can also be considered a long “sentence,” needs an adaption for more tokens. A document mostly consists in multiple sentences. If we interpret a document as a long sentence, we can use the same approaches proposed for the sentence-level applications while taking into account the fact that there are more tokens. For example, a hierarchical architecture is usually adopted for document representation, especially in RNN, as shown in Fig. 10. Generally speaking, all the sentence-level approaches can be used in document-level representation, especially if the document is not so long.

Fig. 10
figure 10

Hierarchical recurrent neural network [64]

3.3 Sentence-Pair Level Application

The difference between sentence applications and sentence-pair applications is the extra interaction module (we call it a matching module), as shown in Fig. 11. Evaluating the relationship between two sentences (or a sentence pair) is typically considered a matching task, e.g., information retrieval [73, 74, 129], natural language inference [14], paraphrase identification [27], and question answering. It is worth mentioning that the Reading Comprehension (RC) task can also be a matching task (especially question answering) when using an extra context, i.e., a passage for background knowledge, while the question answering (answer selection) does not have specific context. In the next subsection, we will introduce the Question Answering task and Reading Comprehension task.

Fig. 11
figure 11

The figure shows that the main difference between a sentence-pair task and a sentence-based task is that there is one extra interaction for the matching task

Question Answering

Differently from expert systems with structured knowledge, question answering in IR is more about retrieval and ranking tasks in limited unstructured document candidates. In some literature, reading comprehension is also considered a question answering task like SQuAD QA. Generally speaking, reading comprehension is a question answering task in a specific context like a long document with some internal phrases or sentences as answers, as shown in Fig. 12. Table 1 reports current popular QA datasets.

Table 1 Popular QA dataset
Fig. 12
figure 12

A demo of SQuAD dataset [85]

In order to compare the neural matching model and non-neural models, we focus on TREC (answer selection), which has limited answer candidates, instead of an unstructured document as context in reading comprehension. Some matching methods are shown in Table 2, which mainly refers to the ACL wiki page.Footnote 7

Table 2 State-of-the-art methods for sentence selection, where the evaluation relies on the TREC QA dataset

3.4 Seq2seq Application

Seq2seq is a kind of task with both input and output as sequential objects, like a machine translation task. It mainly uses an encoder–decoder architecture [19, 100] and further attention mechanisms [3], as shown in Fig. 13. Both the encoder and decoder can be implemented as RNN [19], CNN [34], or only attention mechanisms (i.e., Transformer [111]).

Fig. 13
figure 13

An illustration of the proposed Seq2seq (RNN Encoder–Decoder)

3.5 Evaluation

The basic evaluations of word embedding techniques are based on the above applications [94], e.g., word-level evaluation and downstream NLP tasks like those mentioned in the last section, as shown in [58]. Especially for a downstream task, there are two common ways to use word embedding, namely as fixed features or by treating it only as initial weights and fine-tuning it. We mainly divide it into two part of evaluations, i.e., context-free word properties and embedding-based downstream NLP tasks, while the latter may involve the context and the embedding can be fine-tuned.

Word Property

Examples of the context-free word properties include word polarity classification, word similarity, word analogy, and recognition of synonyms and antonyms. In particular, one of the typical tasks is called an analogy task [70], which mainly targets both the syntactic and semantic analogies. For instance, “man is to woman” is semantically similar to “king is to queen,” while “predict is to predicting” is syntactically similar to “dance is to dancing.” Word Embedding methods achieve good performance in the above word-level tasks, which demonstrates that the word embedding can capture the basic semantic and syntactic properties of the word.

Downstream Task

If word embedding is used in a context, which means we consider each word in a phrase or sentence for a specific target, we can train the word embedding by using the labels of the specific task, e.g., sequential labeling, text classification, text matching, and machine translation. These tasks are divided by the pattern of input and output, shown in Table 3.

Table 3 The difference of the downstream tasks

Generally speaking, the tasks for the word properties can partially reflect the quality of the word embedding. However, the final performance in the downstream tasks may vary. It is more reasonable to directly assess it in the real-world downstream tasks as shown in Table 3.

4 Reconsidering Word Embedding

Some limitations and trends of word embedding are introduced in Sects. 4.1 and 4.2. We also try to discuss the connections between word embedding and topic models in Sect. 4.3. In Sect. 4.4, the dynamic properties of word embedding are discussed in detail.

4.1 Limitations

Limitation of Distributional Hypothesis

The first concern directly targets the effectiveness of the distributional hypothesis. Lucy and Gauthier [66] find that while word embeddings capture certain conceptual features such as “is edible” and “is a tool,” they do not tend to capture perceptual features such as “is chewy” and “is curved,” potentially because the latter are not easily inferred from distributional semantics alone.Footnote 8

Lack of Theoretical Explanation

Generally, humans perceive the words with various aspects other than only the semantic aspect, e.g., sentimental polarity and semantic hierarchy like WordNet. Thus, mapping a word to a real-valued vector is a practical but preliminary method, which leads to limited hints for humans to understand. For a given word vector, it is hard for humans to know what exactly the word means; the scalar value of each element in a word vector does not provide too much physical meaning. Consequently, it is difficult to interpret obtained vector space from the human point of view.

Polysemy Problem

Another problem with word embeddings is that they do not account for polysemy, instead assigning exactly one vector per surface form. Several methods have been proposed to address this issue. For example, Athiwaratkun and Wilson [2] represent words not by single vectors, but by Gaussian probability distributions with multiple modes—thus capturing both uncertainty and polysemy. Upadhyay et al. [109] leverage multi-lingual parallel data to learn multi-sense word embeddings, for example, the English word “bank,” which can be translated into both the French words banc and banque (evidence that “bank” is polysemous), and help distinguish its two meanings.

Out-Of-Vocabulary Problem

With a pre-trained word embedding, some words may not be found in the vocabulary of the pre-trained word vectors, that is, the Out-Of-Vocabulary (OOV) problem. If there are many OOV words, the final performance decreases largely due to the fact that we use a partial initialization from the given word vectors, while other words are randomly initialized, instead of initializing all the weights. This happened more frequently in some professional domains, like medicine text analysis, since it is not easy to find some professional words in a general corpus like Wikipedia.

Semantic Change Over Time

One of the limitations of most word embedding approaches is that they assume that the meaning of a word does not change over time. This assumption can be a limitation when considering corpora of historic texts or streams of text in newspapers or social media. Section 4.4 will discuss some recent works which aim to explicitly include the temporal dimensions in order to capture how the word meaning changes over time.

4.2 Trends

Interpretability

One of the definitions of “interpretability” is proposed by Lipton [65]. In particular, Lipton [65] identifies two broad approaches to interpretability: post-hoc explanations and transparency. Post-hoc explanations take a learned model and draw some useful insights from it; typically these insights provide only a partial or indirect explanation of how the model works. The typical examples are visualization (e.g., in machine translation [26]) and transfer learning.

Transparency asks more directly “how does the model work?” and seeks to provide some way to understand the core mechanisms of the model itself. As Manning said, “Both language understanding and artificial intelligence require being able to understand bigger things from knowing about smaller parts.”Footnote 9 Firstly, it is more reasonable to build a bottom-up system with linguistically structured representations like syntax or semantic parsers and sub-word structures (refer to Sect. 4.2) than an end-2-end system without consideration of any linguistic structures. Moreover, we can use some constrains to normalize each subcomponent and make it understandable for humans, as well as relieve the non-convergent problems. For instance, an attention mechanism [3] is one of the most successful mechanisms from the point view of normalization. For an unconstrained real-valued vector, it is hard to understand and know how it works. After the addition of a softmax operation, this vector denotes a multinomial probability distribution in which each element ranges from 0 to 1 and the sum of the vectors equals 1.

Contextualized Word Embedding

Previously, word embedding was static, which means it did not depend on the context and it was one-to-one mapping from a word to a static vector. For example, the word “bank” has at least two meanings, i.e., “the land alongside or sloping down to a river or lake” and “a financial establishment that invests money deposited by customers, pays it out when required, makes loans at interest, and exchanges currency.” However, the word in a finance-related context and a river-related context could be mapped into the same fixed vector, which is not reasonable for language. Instead of storing a static look-up table, contextualized word embedding learns a language model to generate a real-time word vector for each word based on the neighboring word (context). The first model was proposed with the name Embedding from Language MOdel (ELMO) [80], and it was further investigated by Generative Pre-Training (GPT) [83] and BERT [25]. More specifically, BERT obtained new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark, MultiNLI accuracy, and the SQuAD with huge improvements.

Linguistically Enhanced Word Embedding

One of the main criticisms of word embedding is that it ignores the linguistic knowledge and instead adopts a brute force approach which is totally driven by data. However, there are already many linguistic resources for words, e.g., WordNet and sentimental lexicon. Incorporation of the linguistic knowledge trends in the current paradigm of the NLP can relieve the dependence of data. These linguistic resources are expected to enhance the representative power of word embedding, which may be used in higher layers than word embedding layers like syntax structures [101] or only word embedding with WordNet and related lexicon resources [30, 61].

Sub-Word Embedding

We briefly discussed the OOV problem in Sect. 4.1. Previous solutions for relieving it were commonly based on empirical insights, e.g., assigning a special token to all the OOV words. In [132] character-based embedding for text classification was adopted, avoiding directly processing the word-level embedding. In the sub-word embedding, there are no OOV problems since the proposed approaches directly build the word embedding with the units with smaller granularity which may have a limited number. For example, one of the sub-word approaches in English is based on characters, which are limited to a–z, A–Z, 0–9, punctuation, and other special symbols. Moreover, a character level approach could be beneficial for some specific languages, like Chinese, that can make use of smaller-granularity units which are smaller than words but also have abundant semantic information, like components. Sub-word regularization [54] trains the model with multiple sub-word segmentation (based on a unigram language model) probabilistically sampled during training. These works demonstrate that there is some potential to incorporate some fine-refined linguistic knowledge in the neural network [13, 54].

Advanced Word Embedding: Beyond one Fixed Real-Valued Vector

More recently, different types of word embedding beyond real-valued vectors have been developed, for example:

  • Gaussian embedding [112] assigns a Gaussian distribution for each word, instead of a point vector in a low-dimension space. The advantages are that it naturally captures uncertainty and expresses asymmetries in the relationship between two words.

  • Hyperbolic embedding [77, 93] embeds words as points in a Cartesian product of hyperbolic spaces; therefore, the hyperbolic distance between two points becomes the Fisher distance between the corresponding probability distribution functions (PDFs). This additionally derives a novel principled “is-a” score on top of word embeddings that can be leveraged for hypernymy detection.

  • Meta embedding [50, 127] adopts multiple groups of word vectors and adaptively obtains a word vector by leveraging all the word embeddings.

  • Complex-valued embedding [62, 114] formulates a linguistic unit as a complex-valued vector, and links its length and direction to different physical meanings: the length represents the relative weight of the word, while the direction is viewed as a superposition state. The superposition state is further represented in an amplitude-phase manner, with amplitudes corresponding to the lexical meaning and phases implicitly reflecting the higher-level semantic aspects such as polarity, ambiguity, or emotion.

4.3 Linking Word Embedding to Vector-Space Based Approaches and Representation of Thematic Structures

Deriving the Topic Distribution from Word Embedding

Research on the representation of themes in an unstructured document corpus—finding word patterns in a document collection—dates back to the 1990s, i.e., to the introduction of LSA [24]. A subsequent extension that exploits a statistical model was proposed By Hofmann in[44]. That model, named Probabilistic Latent Semantic Indexing (PLSI), relies on the aspect model, a latent variable model for co-occurrence data where an occurrence—in our case a word occurrence—is associated with an unobserved/latent variable. The work by Hofmann and subsequent works rely on the “same fundamental idea—that a document is a mixture of topics—but make slightly different statistical assumptions” [99]. For instance, in [12] Blei et al. extended the work by Hofmann making an assumption on how the mixture weights for the topics in a document are generated, introducing a Dirichlet prior. This line of research is known as topic modeling, where a topic is interpreted as a group of semantically related words. Since the focus of this paper is not on topic modeling, in the remainder of this section we are going to introduce only the basic concepts needed to discuss possible links with word embedding approaches; the reader can refer to the work reported in [9, 11, 15, 99] for a more comprehensive discussion on the difference among the diverse topic models and the research trends and direction in topic modeling.

As mentioned above, probabilistic topic models consider the document as a distribution over topics, while the topic is a distribution over words. In PLSI no prior distributions are adopted and the joint probability distribution between document and word is expressed as follows:

$$\displaystyle \begin{aligned} p(w,d)= \sum_{c \in C } p(w,d,c)= p(d) \sum_{c \in C} p(c,w |d) = p(d) \sum_{c \in C} p(c|d) p(w|c), \end{aligned} $$
(1)

where d is a document, while w is a specific word and C is the collection of topics. A crucial point of topic models is how to estimate the p(c|d) and p(w|c).

Using an “empirical” approach, we can also get the p(c|d) and p(w|c) from word embedding. Suppose that we obtain a word embedding, i.e., a mapping from a word (denoted as an index with a natural number) to a dense vector \( \mathcal {N} \rightarrow \mathcal {R}^n \). For a given sentence S with words sequence {w 1, w 2, …w n}, we can get a representation for s with an average embedding like [49], namely \(\boldsymbol {d} = \sum _{i=1}^n \boldsymbol {w_i}\). It is easy to define a topic with distribution p(w|c), represented as: \(\boldsymbol {c_j} = \sum _{i=1}^{|V|} p_{w_i|c} \boldsymbol {w_i}, \boldsymbol {c_j} \in C\). Then we can obtain the following topic distribution of a document:

$$\displaystyle \begin{aligned} p(c_j|d) = \frac{e^{-\vert \vert \boldsymbol{d} - \boldsymbol{c_j} \vert \vert_2 }}{\sum_i^{\vert C \vert} e^{-\vert \vert \boldsymbol{d} - \boldsymbol{c_i}\vert \vert_2 }}. \end{aligned} $$
(2)

The relationship between word embedding and topic models has been addressed in the literature. For instance, the work reported in [60] shows that a special case of word embedding, i.e., Skip-gram, has the same optimal solution as the factorization of a shifted Point-wise Mutual Information (PMI) matrix.Footnote 10 Empirically, the count-based representations and distributed representations can be combined together with complementary benefits [78, 115].

Recent works focused on exploiting both methods. The discussion of previous approaches reported in [98] reports on two lines of research: methods used to improve word embedding through the adoption of topic models, which addresses the polysemy problem; methods used to improve topic models through word embedding, which obtains more coherent words among the top words associated with a topic. These approaches mainly rely on a pipeline strategy, “where either a standard word embedding is used to improve a topic model or a standard topic model is used to learn better word embeddings” [98]. The limitation of these approaches is the lack of capability to exploit the mutual strengthening between the two, which a joint learning strategy, in principle, could exploit. This is the basic intuition underlying the work reported in [98]. Another example is lda2vec where the basic idea was “modifying the Skip-gram Negative-Sampling objective in [71] to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors.” The work reported in [117] proposes a different approach relying on a “topic-aware convolutional architecture” and a reinforcement learning algorithm in order to address the task of text summarization.

Regarding the Contextual Windows

The previous subsection suggests possible connections between word embedding and the representation of the thematic structure in document corpora, e.g., through topic models. Vector space based approaches in IR, topic models, matrix factorization, and word embedding can be considered as different approaches relying on distributional hypothesis as discussed in Sect. 1. One of the differences among these methods may be how to choose the size of the contextual window. In this paper, we classify the contextual window into several sizes, i.e., “character → word → phase/N-gram → clause → sentence → paragraph → document,” ordered from the smallest to the biggest granularity. For example, VSM in IR usually chooses the whole document as the context; thus, it may capture the document-level feature of text, like the thematic structure. Approaches based on word-word matrix factorization usually set a smaller window size to statistically analyze the co-occurrence between words—similar to the windows of CBOW [70], thus targeting a smaller context in order to capture the word-level feature related to its word meaning.

Depending on the context size, features in vector space based approaches in IR are already at a relatively high level, e.g., the TFIDF vector or the language model [130], and they can be used directly for relatively downstream task like document ranking. Lower-level word features of word-word matrix factorization (or CBOW) can be used directly for the relatively upstream task like morphology, lexicon, and syntax, and it needs some abstraction components to extract from the low-level features to high-level features. On the other hand, abstraction from the low-level features to high-level features may imply a loss of some fundamental lexical meaning. The low-level features (word-word matrix factorization or CBOW) are usually considered a better basic input for another “stronger” learning model—e.g., when using multiple layers of non-linear abstraction—compared to higher-level features.

4.4 Towards Dynamic Word Embedding

One of the limitations of most representations of words, documents, and themes is that they do not consider the temporal dimension. This is crucial when considering corpora such as historical document archives, newspapers, or social media, e.g., tweets, that consist in a continuous stream of informative resources. The use of these “time-stamped” resources is useful not only for general tasks but also for specialist users. Indeed, the tasks performed by the specialists of a discipline need to make hypotheses from data, for example, by means of longitudinal studies. This is the case for the tasks performed by specialists in the field of Social Science, Humanities, Journalism, and Marketing.

Let us consider, for instance, the case of sociologists that study the public perception of science and technology by the public opinion—this line of research is known as STS, Science and Technology Studies. The study of how some science and technology-related issues are discussed by the media, e.g., newspapers, could be useful in providing policy makers with insights on the public perception of some issues on which they should or intend to take actions or provide guidance on the way these issues should be publicly discussed (e.g., on the use of “sensible” words or aspects related to the issues). In this context, relevant information can be gained from how the meaning of a word or how the perception of an issue related to a word change through time.

Previous works on topic modeling addressed the issue of including the temporal dimension, specifically, the issue that topics can change over time. In [72] Mimno proposes a possible approach to visualize the topic coverage across time starting from topic learnt using a “static” approach: given the probabilities and the topic assignment estimated via LDA, the topic trend can be visualized by counting the number of words in each topic published in a given year and then normalizing over the total number of words for that year. Other works embedded the time dependence directly in the statistical model. One of the earliest works is that proposed in [10] where dynamic topic models were introduced. The underlying assumption is that time is divided into time slices, e.g., by years; documents in a specific time slice are modeled using a K-component topic model—K is the number of topics—where topics in a given time slice evolve from those in the previous time slice. This kind of representation could be extremely useful for a specialist in order to follow the evolution of a single word, e.g., by inspecting the top words for diverse topics where the word is framed in his research hypothesis—e.g., the “nuclear” word framed in “innovation,” “risk,” or “energy” topics—or following the posterior estimate of the frequency of the word as a function of the year, as shown in [10]. As stated by the authors, one of the limitations of that approach is that the number of topics needs to be specified beforehand; the work reported in [28] aimed to address this limitation by introducing a non-parametric version for modeling topics over time.

Even if dynamic/time-aware versions of topic models can support specialists in their investigation, the adoption of word embedding to study changes in a word representation could provide complementary evidence to support or undermine a research hypothesis. Indeed, as mentioned above, topic models are learned from a more “global view,” while word embedding exploits a more “local view,” e.g., using evidence from local context windows; this local view might help to obtain a word representation that, in a way, “reflects the semantic, and sometimes also syntactic, relationships between the words” [98]. Another point of view about the difference between topic models and word embedding approaches could be the scale of the dimension and the sparseness degree in the vector space. Intuitively, topic models (especially the topic distribution over words) tend to adopt sparse vectors with bigger dimensions, while the word embedding approaches adopt low-dimension dense vectors which may save some memory space and provide more flexibility for the high-level applications. Note that the difference in sparseness can be decreased to some extent by the sparsing regularization as introduced by Vorontsov et al. [113].

The work reported in [55] discussed several approaches to identify “linguistic change.” As an example of linguistic change, they referred to the change of the word “gay” that shifted from the meaning of “cheerful” or “frolicsome” to homosexuality (see Fig. 1 of that paper). They proposed three different approaches to generate time series aimed to capture different aspects of word evolution across time: a frequency-based method, a syntactic method, and a distributional method. Because of the objective of this survey, we will focus on the last one. They divided the entire time span of the dataset in time slices of the same size, e.g., 1-month or 5-year slices. Then a word embedding technique—gensim implementation of the Skip-gram model—was used to learn word representation in each slice; an alignment procedure was then adopted to consider all the embeddings in a unique coordinate system. Finally, the time series was obtained by calculating the distance between the time 0 and the time t in the embedding space of the final time slice. The use of time series has several benefits, e.g., the possibility to use change point detection methods to identify the point in time where the new word meaning became predominant. The distributional approach was the most effective in the various evaluation settings: synthetic evaluation, evaluation on a reference dataset, and evaluation with human assessors.

In [37] the change in meaning of a word through time is referred to as a “semantic change.” The authors report several examples in word meaning change, e.g., the semantic change of the word “gay” as in [55] and that of the word “broadcast,” which at the present time is mainly intended as a synonym of “transmitting signal.” While in the early twentieth century it meant “casting out seeds.” In that work, static versions of word embedding techniques were used, but word embedding was learned for each time slice and then aligned in order to make word vectors from different time periods comparable; aligning is addressed as an Orthogonal Procrustes Problem. Three word embedding techniques were considered. The first is based on Positive Point-wise Mutual Information (PPMI) representations, where PPMI values are computed with respect to pre-specified context words and are prepared in a matrix whose rows are the word vector representations. The second approach, in the paper referred to as SVD, considers a truncated version of the SVD of the PPMI matrix. The last method is Skip-gram with negative sampling. The work reported in that paper is pertinent to our “specialist user scenario” since the main contribution is actually a methodology to investigate two research hypotheses. In particular, the second hypothesis investigated is that “Polysemous words change at faster rates”; this is related to an old hypothesis in linguistics that dates back to [16] and states that “words become semantically extended by being used in diverse contexts.” Subsequent works [29] show that the results obtained in the literature for diverse hypotheses on semantic change—including those in [37]—should be revised; using as a control test an artificially generated corpus with “no semantic change” as a control test, they showed that the previously proposed methodologies detected a semantic change in the control test as well. The same result was observed for diverse hypotheses—see the survey reported in [56] for an overview of the diverse hypotheses investigated. As mentioned by Dubossarsky et al. [29], their result supports further research in evaluation of dynamic approaches “articulating more stringent standards of proof and devising replicable control conditions for future research on language change based on distributional semantics representations” [29].

The work reported in [90] introduces a dynamic version of the exponential family of embedding previously proposed in [91]. The reason for the introduction of the exponential family of embedding was to generalize the idea of word embedding to other data, e.g., neuronal activity or shopping for an item on the basis of the context (other items in the shopping cart). The obtained results show that the dynamic version of the exponential family embedding provides better results in terms of conditional likelihood of held-out predictions when compared with static embeddings [71, 91] and time-binned embeddings [37].

In [4] the authors extend the Bayesian Skip-gram Model proposed in [5] to a dynamic version considering a diffusion process of the embedding vectors over time, more specifically a Ornstein–Uhlenbeck process. Both of the two proposed variants resulted in more smoothed word embedding trajectoriesFootnote 11 than the baselines, which utilized the approach proposed in [37].

In [125] the authors proposed to find temporal word embedding to solve a joint optimization problem where the “key” component is a smoothing term that encourages embedding to be aligned, thus explicitly solving the alignment problem while learning embedding and avoiding a two-step strategy like that adopted in [37] or in [55].

In [56] the authors report a number of open issues concerning the study of temporal aspects of semantic shifts. Two challenges that are particularly relevant to the works reported in this chapter and this venue are: (1) the lack of formal mathematical models of diachronic embeddings; (2) the need for robust gold standard test sets of semantic shifts; (3) the need for algorithms able to work on small datasets. With regard to the first point, investigating quantum-inspired models could be a possible research direction to find a formal mathematical framework to model dynamic/diachronic word embeddings, e.g., exploiting the generalized view of probability and the theory of time evolution of systems. With regard to the second point, and evaluation in general, a possible direction is to devise tasks with specialists, e.g., journalists, linguists, or social scientists, to create adequate datasets. This is also related to the last point, i.e., the need for algorithms that are “robust” to the size of the dataset: indeed, specialists, even when performing longitudinal user studies, can rely on relatively small datasets in order to investigate specific research issues. On the basis of the ongoing collaboration with sociologists and linguists, another open issue that could be really beneficial for the specialists investigations is “identifying groups of words that shift together in correlated ways” [56]; this could be particularly useful to investigate how some thematic issues are perceived by the public opinion and how this perception varies through time. As suggested by the results reported in [29], evaluation protocols to measure these algorithms’ effectiveness should be rigorously designed.

As mentioned above, word embedding and topic models are based on two very different views. Rudolph et al. [90] suggest another possible research direction in the dynamic representation of words: devise models able to combine the two approaches and exploit their “complementary” representations in dynamic settings.

5 Conclusion

We introduced many vector space based approaches for representing words, especially the word vector techniques. Regarding the word vector, we introduced many variants presented throughout in the history and their limitations and trends. A concise summary is reported in Table 4.

Table 4 A summary including various word vector techniques

Since the effectiveness of word embedding is supported by the investigation in many NLP and IR tasks and by many benchmarks, it is worth investigating further. In the future, it is expected to incorporate some external knowledge like linguistic features or the common sense of humans (like knowledge base) to word vectors. Besides these empirical efforts, some theoretical understanding is also important to this field, like the interpretability about why it works and where it does not work.