Keywords

1 Introduction

Semantic Textual Similarity (STS) determines the similarity between two pieces of texts. It has applicability in a variety of Natural Language Processing (NLP) tasks including textual entailment, paraphrase, machine translation, and many more. It aims at providing a uniform structure for generation and evaluation of various semantic components that, conventionally, were considered independently and with a superficial understanding of their impact in various NLP applications.

The SemEval STS task is an annual event held as part of the SemEval/*SEM family of workshops. It was one of the most awaited events for STS from 2012 to 2017 [1,2,3,4,5,6], that attracted a large number of teams every year for participation. The dataset is available publicly by the organizers containing up to 16000 sentence pairs for training and testing that is annotated by humans with a rating between 0–5 with 0 indicating highly dissimilar and 5 being highly similar.

Generally, the techniques under the umbrella of STS can be classified into the following two categories:

  1. 1.

    Supervised Systems: The techniques designed in this category generate results after conducting training with an adequate amount of data using a machine learning or deep-learning based model [9, 10]. Deep learning has gained a lot of popularity in NLP tasks. They are extremely powerful and expressive but are also complex and non-linear. The increased model complexity makes such models much slower to train on larger datasets.

  2. 2.

    Unsupervised Systems: To our surprise, the basic approach of plain averaging [11] and weighted averaging [12] word vectors to represent a sentence and computing the degree of similarity as the cosine distance has outperformed LSTM based techniques. Examples like these strengthen the researchers that lean towards the simpler side and exploit techniques that have the potential to process a large amount of text and are scalable instead of increased model complexity. Some of the techniques under this category may have been proposed even before the STS shared task [19, 20] whiles some during. Some of these techniques usually rely on a lexical database such as paraphrase database (PPDB) [7, 8], wordnet [21], etc. to determine contextual dependencies amongst words.

The technique that is proposed in this study is based on spectral learning and is fairly simple. The idea behind the approach stems from the fact that the semantically equivalent sentences are dependent on a similar context. Hence goal here is to identify semantic components that can be utilized to frame context from both the sentences. To achieve that we propose a model that identifies such semantic units from a sentence based on its correlation from words of another sentence. The method proposed in the study, a spectral learning-based approach for measuring the strength of similarity amongst two sentences based on Canonical Correlation Analysis (CCA) [22] uses cosine similarity and Word Mover’s Distance (WMD) as calculation metric. The model is fast, scalable, and scale-invariant. Also, the model is linear and have the potential to perform at par with the non-linear supervised learning architectures such as such as LSTM and BiLSTM. It also adds another layer by identifying semantic components from both the sentences based on their correlation. These components can help develop a deeper level of language understanding.

2 Canonical Correlation Analysis

Given two sets of variables, canonical correlation is the analysis of a linear relationship amongst the variables. The linear relation is captured by studying the latent variables (variables that are not observed directly but inferred) that represent the direct variables. It is similar to correlation analysis but multivariate. In the statistical analysis, the term can be found in multivariate discriminant analysis and multiple regression analysis. It is an analog to Principal Component Analysis (PCA), for a set of outputs. PCA generates a direction of maximal covariance amongst the elements of a matrix, in other words for a multivariate input on a single output, whereas CCA generates a direction of maximal covariance amongst the elements of a pair of matrices, in other words for a multivariate input on a multivariate output.

Consider two random multivariable x and y. Given C\(_\text {xx}\), C\(_\text {yy}\), C\(_\text {yx}\) that represents the within-sets and between-sets covariance matrix of x and y and C\(_\text {xy}\) is a transpose of C\(_\text {yx}\), CCA tries to generate projections CV\(_{1}\) and CV\(_{2}\), a pair of linear transformations, using the optimization problem given by Eq. 1.

$$\begin{aligned} \begin{aligned}&\underset{CV_{1}, CV_{2}}{\text {max}}&\frac{CV_{1}^T C_\text {xy} CV_{2}}{ \sqrt{CV_{1}^T C_\text {xx} CV_{1}} \sqrt{CV_{2}^T C_\text {yy} CV_{2}} }\\ \end{aligned} \end{aligned}$$
(1)

Given x and y, the canonical correlations are found by exploiting the eigenvalue equations. Here the eigenvalues are the squared canonical correlations and the eigenvectors are the normalized canonical correlation basis vectors. Other than eigenvalues and eigenvectors, another integral piece for solving Eq. 1 is to compute the inverse of the covariance matrices. CCA utilizes Singular value decomposition (SVD) or eigen decomposition for performing the inverse of a matrix. Recent advances [24] have facilitated such problems with a boost on a larger scale. This boost is what makes CCA fast and scalable.

More specifically, consider a group of people that have been selected to participate in two different surveys. To determine the correlation between the two surveys CCA tries to project a linear transformation of the questions from survey 1 and questions from survey 2 that maximizes the correlation between the projections. CCA terminology identifies the questions in the survey as the variables and the projections as variates. Hence the variates are a linear transformation or a weighted average of the original variables. Let the questions in survey 1 be represented as \(\text {x}_{1}, \text {x}_{2}, \text {x}_{3}....\, \text {x}_{\text {n}}\) similarly questions in survey 2 are represented as \(\text {y}_{1}, \text {y}_{2}, \text {y}_{3}....\text {y}_{\text {m}}\). The first variate for survey 1 is generated using the relation given by Eq. 2.

$$\begin{aligned} \begin{aligned}&CV_{1} = a_{1}x_{1} + a_{2}x_{2} + a_{3}x_{3}+ ..... a_{\text {n}}x_{\text {n}} \end{aligned} \end{aligned}$$
(2)

And the first variate for survey 2 is generated using the relation given by Eq. 3.

$$\begin{aligned} \begin{aligned}&CV_{1} = b_{1}y_{1} + b_{2}y_{2} + b_{3}y_{3}+ ..... b_{\text {m}}y_{\text {m}} \end{aligned} \end{aligned}$$
(3)

Where \(\text {a}_{1}, \text {a}_{2}, \text {a}_{3}\, ..... \,\text {a}_{\text {n}}\) and \(\text {b}_{1}, \text {b}_{2}, \text {b}_{3} \,.... \,\text {b}_{\text {m}}\) are weights that are generated in such a way that it maximizes the correlation between CV\(_{1}\) and CV\(_{2}\). CCA can generate the second pair of variates using the residuals of the first pair of variates and many more in such a way that the variates are independent of each other i.e. the projections are orthogonal.

When applying CCA the following fundaments are needed to be taken care of:

  1. 1.

    Determine the minimum number of variates pair be generated.

  2. 2.

    Analyze the significance of a variate from two perspectives – one being the magnitude of relatedness between the variate and the original variable from which it was transformed and the magnitude of relatedness between the corresponding variate pair.

2.1 CCA for Computing Semantic Units

Given two views X = (X\(^{(1)}\), X\(^{(2)}\)) of the input data and a target variable Y of interest, Foster [23] exploits CCA to generate a projection of X that reduces the dimensionality without compromising on its predictive power. Authors assume, as represented by Eq. 4, that the views are independent of each other conditioned on a hidden state h, i.e.

$$\begin{aligned} \begin{aligned}&P(X^{(1)}, X^{(2)}|h) =P (X^{(1)}|h) P (X^{(2)}|h) \\ \end{aligned} \end{aligned}$$
(4)

Here CCA utilizes the multi-view nature of data to perform dimensionality reduction.

STS is an estimate of the prospective of a candidate sentence to be considered as a semantic counterpart of another sentence. Measuring text similarity has had a long-serving and contributed widely in applications designed for text processing and related areas. Text similarity has been used for machine translation, text summarization, semantic search, word sense disambiguation, and many more. While making such an assessment is trivial for humans, making algorithms and computational models that mimic human-level performance poses a challenge. Consequently, natural language processing applications such as generative models typically assume a Hidden Markov Model (HMM) as a learning function. HMM also indicates a multi-view nature. Hence, two sentences that have a semantic unit(s) c with each other provide two natural views and CCA can be capitalized, as shown in Eq. 5, to extract this relationship.

$$\begin{aligned} \begin{aligned} P(S_{1}, S_{2}|c) = P (S_{1}|c) P (S_{2}|c) \end{aligned} \end{aligned}$$
(5)

Where S\(_{1}\) and S\(_{2}\) mean sentence one and sentence two that are supposed to have some semantic unit(s) c. It has been discussed in the previous section that CCA is fast and scalable. Also, CCA neither requires all the views to be of a fixed length nor have the views to be of the same length; hence it is scale-invariant for the observations.

3 Model

3.1 Data Collection

We test our model in three textual similarity tasks. All three of which were published in SemEval semantic textual similarity (STS) tasks (2012–2017). The first dataset considered for experimenting was from SemEval -2017 Task 1 [6], an ongoing series of evaluations of computational semantic analysis systems with a total of 250 sentence pairs. Another data set was SemEval textual similarity dataset 2012 with the name “OnWN” [4]. The sentence pair in the dataset is generated from the Ontonotes and its corresponding wordnet definition. Lastly, SemEval textual similarity dataset 2014 named “headlines” [2] that contains sentences taken from news headlines. Both the datasets have 750 sentence pairs. In all the three datasets a sentence pair is accompanied with a rating between 0–5 with 0 indicating highly dissimilar and 5 being highly similar. An example of a sentence pair available in the SemEval semantic textual similarity (STS) task is shown in Table 1.

Table 1. A sample demonstration of sentence pair available in the SemEval semantic textual similarity (STS) task publically available dataset.

3.2 Data Preprocessing

It is important to pre-process the input data to improve the learning and elevate the performance of the model. Before running the similarity algorithm the data collected is pre-processed based on the following steps.

  1. 1.

    Tokenization - Processing one sentence at a time from the dataset the sentence is broken into a list of words that were essential for creating word embeddings.

  2. 2.

    Removing punctuations - Punctuations, exclamations, and other marks are removed from the sentence using regular expression and replaced with empty strings as there is no vector representation available for such marks.

  3. 3.

    Replacing numbers - The numerical values are converted to their corresponding words, which can then be represented as embeddings.

  4. 4.

    Removing stop words - In this step the stop words from each sentence are removed. A stop word is a most commonly used word (such as “the”, “a”, “an”, “in”) that do not add any valuable semantic information to our sentence. The used list of stop words is obtained from the nltk package in python.

3.3 Identifying Semantic Units

Our contribution to the STS task adds another layer by identifying semantic units in a sentence. These units are identified based on their correlation with the semantic units identified in the paired sentence. Each sentence \(\text {s}_{\text {i}}\) is represented as a list of the word2vec embedding, where each word is represented in the m -dimensional space using Google’s word2vec. \(\text {s}_\mathrm{i} = (\text {w}_\mathrm{i1}, \text {w}_\mathrm{i2}, ..., \text {w}_\mathrm{im}\)), i = 1, 2, ..., m, where each element is the embedding counterpart of its corresponding word. Given two sentences \(\text {s}_{\text {i}}\) and \(\text {s}_{\text {j}}\), CCA projects variates as linear transformation of \(\text {s}_{\text {i}}\) and \(\text {s}_{\text {j}}\). The number of projections to be generated is limited to the length, i.e. no. of words, of the smallest vector between \(\text {s}_{\text {i}}\) and \(\text {s}_{\text {j}}\). E.g. if the length of \(\text {s}_{\text {i}}\) and \(\text {s}_{\text {j}}\) is 8 and 5 respectively, the maximum number of correlation variates outputted is 5. Conventionally, word vectors were considered independently and with a superficial understanding of their impact in various NLP applications. But these components obtained can contribute vividly in an NLP task. A sample of semantic units identified on a sentence pair is shown in Table 2.

Table 2. A sample of semantic units identified on a sentence pair in the SemEval dataset.

3.4 Formulating Similarity

The correlation variates projected by CCA are used to generate a new representation for each sentence \(\text {s}_{\text {i}}\) as a list of the word2vec vectors, \(\text {s}_{\text {i}} = (\text {w}_\mathrm{i1}, \text {w}_\mathrm{i2}, ..., \text {w}_\mathrm{in}\)), i = 1, 2, ..., n, where each element is the Google’s word2vec word embedding of its corresponding variate identified by CCA.

Given a range of variate pairs, there are two ways of generating a similarity score for sentence \(\text {s}_{\text {i}}\) and \(\text {s}_{\text {j}}\):

  1. 1.

    Cosine similarity: It is a very common and popular measure for similarity. Given a pair of sentence represented as \(\text {s}_{\text {i}} = (\text {w}_\mathrm{i1}, \text {w}_\mathrm{i2}, ..., \text {w}_\mathrm{im}\)) and \(\text {s}_{\text {j}} = (\text {w}_\mathrm{j1}, \text {w}_\mathrm{j2}, ..., \text {w}_\mathrm{jm}\)), cosine similarity measure is defined as Eq. 6

    $$\begin{aligned} \begin{aligned} sim(s_\mathrm{i},s_\mathrm{j}) = \frac{\sum _{k=1}^{m} w_\mathrm{ik} w_\mathrm{jk}}{\sqrt{\sum _{k=1}^{m} w_\mathrm{ik}^2} \sqrt{\sum _{k=1}^{m} w_\mathrm{jk}^2}} \end{aligned} \end{aligned}$$
    (6)

    Similarity score is calculated by computing the mean of cosine similarity for each of these variate pairs.

  2. 2.

    Word Mover’s Distance (WMD): WMD is a method that allows us to assess the “distance” between two documents in a meaningful way. It harnesses the results from advanced word –embedding generation techniques like Glove [13] or Word2Vec as embeddings generated from these techniques are semantically superior. Also, with embeddings generated using Word2Vec or Glove it is believed that semantically relevant words should have similar vectors. Let \(\text {T} = (\text {t}_{1}, \text {t}_{2}, ..., \text {t}_{\text {m}}\)) represents a set with m different words from a document A. Similarly \(\text {P} = (\text {p}_{1}, \text {p}_{2}, ..., \text {p}_{\text {n}}\)) represents a set with n different terms from a document B. The minimum cumulative distance traveled amongst the word cloud of the text document A and B becomes the distance between them.

A min-max normalization, given in Eq. 7, is applied on the similarity score generated by cosine similarity or WMD to scale the output similarity score to 5.

$$\begin{aligned} \begin{aligned} x_{\text {scaled}} = \frac{x - x_{\text {min}}}{x_{\text {max}} - x_{\text {min}}} \end{aligned} \end{aligned}$$
(7)

4 Results and Analysis

The key evaluation criterion is the Pearson’s coefficient between the predicted scores and the ground-truth scores. The results from the “OnWN” and “Headlines” dataset published in SemEval semantic textual similarity (STS) task 2012 and 2014 respectively is shown in Table 3. The first three results are from the official task rankings followed by seven models proposed by Weintings [11]. The last two column indicate the result from the model proposed with cosine similarity and WMD respectively. The dataset published in SemEval semantic textual similarity (STS) tasks 2017 is identified as Semantic Textual Similarity Benchmark (STS-B) by the General Language Understanding Evaluation (GLUE) benchmark [16]. The results of the official task rankings for the task STS-B are shown in Table 4. Table 5 indicate the result from the model proposed with cosine similarity and WMD respectively. Since the advent of GLUE, a lot models have been proposed for the STS-B task, such as XLNet [17], ERNIE 2.0 [18] and many more, details of these models are available on the official website of GLUEFootnote 1, that produces result above 90% in STS-B task. But the increased model complexity makes such models much slower to train on larger datasets. The work here focuses on finding semantic similarity by identifying semantic components using an approach that is linear, scale-invariant, scalable, and fairly simple.

Table 3. Results on SemEval -2012 and 2014 textual similarity dataset (Pearson’s r x 100).
Table 4. Results on STS-B task from GLUE Benchmark (Pearson’s r x 100).
Table 5. Results of proposed spectral learning-based model on the SemEval 2017 dataset (Pearson’s r x 100).

5 Conclusion

We proposed a spectral learning based model namely CCA using cosine Similarity and WMD, and compared the model on three different datasets with various other competitive models. The model proposed utilizes a scalable algorithm hence it can be included in any research that is inclined towards textual analysis. With an added bonus that the model is simple, fast and scale-invariant it can be an easy fit for a study.

Another important take from this study is the identification of semantic units. The first step in any NLP task is providing a uniform structure for generation and evaluation of various semantic units that, conventionally, were considered independently and with a superficial understanding of their impact. Such components can help in understanding the development of context over sentence in a document, user reviews, question-answer and dialog session.

Even though our model couldn’t give best results it still performed better than some models and gave competitive results for others, which shows that there is a great scope for improvement. One of the limitations of the model is its inability to identify semantic units larger than a word for instance, a phrase. It will also be interesting to develop a model that is a combination of this spectral model with a supervised or an unsupervised model. On further improvement the model will be helpful in various ways and can be used in applications such as document summarization, word sense disambiguation, short answer grading, information retrieval and extraction, etc.