Keywords

1 Introduction

The relationships between words, their representation in text, concepts in the mind, and objects in the real world, has been the source of inquiry over many centuries. Empirical, distributional paradigms have been shown to successfully derive human-like estimates of semantic distance from large text corpora, and recent developments in this area have mediated the enrichment of distributional models with structural information, such as the relative position of terms [1, 2], and orthographic information describing the configuration of characters from which words are composed [3, 4]. This paper extends these beginnings in three principal ways. First, we propose a new and very simple method for encoding structure into semantic vectors, using a quantization of the space between two extreme ‘demarcator vectors’. This vector generation method performs well in experiments and has some key computational advantages. Second, the method is general enough to be applied within a wide range of Vector Symbolic Architectures (VSAs), including Circular Holographic Reduced Representations (CHRRs) that use complex vectors [5]. Thirdly, we investigate some higher-level compositions as well, demonstrating some early results with compositional representations for sentences.

Each of these developments is related to quantum interaction as follows. Demarcator vector generation is similar in a sense to the derivation of orbital angular momentum values in quantum mechanics, in that they use the same mathematics. The application within many VSAs, including those over complex vector space, makes the new methods available within algebras that are particularly related to generalized quantum models. Finally, compositional methods are important in research on generalized quantum structures, and the way many levels of representation are combined seamlessly in the compositional models presented here should be of interest to researchers in the field.

2 Orthogonality in Distributional and Orthographic Models

Geometric methods of distributional semantics derive vector representations of terms from electronic text, such that terms that occur in similar contexts will have similar vectors [6, 7]. While such models have been shown to approximate human performance on a number of cognitive tasks [8, 9], they generally do not take into account structural elements of language, and consequently have been referred to, at times critically, as “bags of words” models. Emerging approaches to semantic space models have leveraged reversible vector transformations to encode additional layers of meaning into vector representations of terms and concepts. Examples include the encoding of the relative position of terms [1, 10], syntactic information [11], and orthographic information [4].

The general approach used depends upon the generation of vector representations for terms, and associating reversible vector transformations with properties of interest. In accordance with terminology developed in [12], we will refer to the vector representations of atomic components such as terms as elemental vectors. Elemental vectors are constructed using a randomization procedure such that they have a high probability of being mutually orthogonal, or close-to-orthogonal. This adds robustness to the model, by making it highly improbable that elemental vectors would be confused with one another, despite the distortion that occurs during training. However, it also introduces the implicit assumption that elemental vectors are unrelated to one another, which means that models generated in this way must be composed of discrete elements.

This limitation notwithstanding, this approach has allowed for the integration of structured information into distributional models of meaning. From the perspective of cognitive psychology, this is desirable as it presents the possibility of a unified term representation that can account for a broad range of experimental phenomena. Recent work in this area has leveraged circular convolution to generate vectors representing the orthographic form of words [3], and integrate these with a geometric model of distributional semantics [4]. Vector representations of orthographic word form are generated by using circular convolution to generate bound products representing the component bigrams of the term concerned, including non-contiguous bigrams. Karchergis et al give the following example (\(\circledast \) indicates binding using circular convolution) [4]:

$$\begin{aligned} word =&w + o + r + d + w \circledast o + o \circledast r + r \circledast d \\ +&w \circledast o + (w \circledast \_ ) \circledast r + (w \circledast \_) \circledast d + (w \circledast o ) \circledast r + ((w \circledast \_ ) \circledast r) \circledast d \\ +&(w \circledast \_ ) \circledast d + ( o \circledast \_ ) \circledast d + r \circledast d + (( w \circledast o)\_) \circledast d + (o \circledast r ) \circledast d \end{aligned}$$

The vector representation for the term “word” is generated by combining a set of vectors representing unigrams, bigrams, and trigrams of characters. It is a characteristic of the model employed that each of these vectors have a high probability of being mutually orthogonal, or close-to-orthogonal. So, for example, the vectors representing the trigrams \(((w \circledast \_ ) \circledast r)\) and \(((w \circledast o ) \circledast r)\) will be dissimilar from one another. Consequently it is necessary to explicitly encode all of the n-grams of interest, including gapped trigrams (such as “w_ r”) to provide flexibility. From the perspective of computational complexity, this is not ideal, as the number of representational units that must be generated and encoded is at least quadratic to the length of the sequence.

Rather than explicitly encoding character position precisely (with respect to some other character, or the term itself), alternative models of orthographic representation allow for a degree of uncertainty with respect to letter position. These approaches measure the relatedness between terms on the basis of the similarity between probability distributions assigned to the positions of each matching character [13, 14], providing a more flexible measure of similarity. However, on account of the constraints we have discussed, only orthographic representations based on discrete bigrams or the exact position of characters have been combined with other sorts of distributional information in an attempt to generate a holistic representation to date [3, 15]. Imposing near-orthogonality adds robustness, but also necessitates ignoring potentially useful information to do with structure, namely the proximity between character positions within a word. Consequently, we have selected the generation of orthographic representations as an example application through which to illustrate the utility of our approach.

The paper proceeds as follows. First, we present the mathematical language we will use to describe the operators provided by VSAs, a family of representational approaches based on reversible vector transformations [16]. Next, we will describe an approach we have developed through which the distance between elemental vectors, and hence bound products, can be predetermined. In the context of an illustrative application for orthographic modeling, we show that this approach permits the encoding of structural information to do with proximity, rather than absolute position, into a distributional model. We then discuss a relationship between this approach and quantum mechanics, and conclude with some experimental results and example applications.

3 Mathematical Structure and Methods

3.1 Vector Symbolic Architectures (VSAs)

The reversible vector transformations we have discussed are a distinguishing feature of a family of representational approaches collectively known as VSAs [16]. In our experiments the VSAs we will use are Kanerva’s Binary Spatter Code (BSC), which uses binary vectors [17], and Plate’s CHRR [5], which uses complex vectors where each dimension represents an angle between \(-\pi \) and \(\pi \), using the implementation developed in [10]. In addition, we will use an approach based on permutation of real vectors [2].

Binding is the primary operation facilitated by VSAs (in addition to standard operators for vector superposition and vector comparison). Binding is a multiplication-like operator through which two vectors are combined to form a third vector C that is dissimilar from either of its component vectors A and B. We will use the symbol “\(\otimes \)” for binding, and the symbol “\(\oslash \)” for the inverse of binding for the remainder of this paper. It is important that this operator be invertible: if \(\mathrm{C} = \mathrm{A} \otimes \mathrm{B}\), then \(\mathrm{A} \oslash \mathrm{C} = \mathrm{A} \oslash (\mathrm{A} \otimes \mathrm{B}) = \mathrm{B}\). In some models, this recovery may be approximate, but the robust nature of the representation guarantees that \(\mathrm{A} \oslash \mathrm{C}\) is similar enough to B that B can easily be recognized as the best candidate for \(\mathrm{A} \oslash \mathrm{C}\) in the original set of concepts. Thus the invertible nature of the bind operator facilitates the retrieval of the information it encodes.

In the case of the BSC, elemental vectors are initialized by randomly assigning 0 or 1 to each dimension with equal probability. Pairwise exclusive or (\({\textsc {XOR}}\)) is used as a binding operator: \(\mathrm {X} \otimes \mathrm {Y} = \mathrm {X}{\textsc {\ XOR\ }}\mathrm {Y}\). As it is its own inverse, the binding and decoding processes are identical (\(\otimes = \oslash \)). For superposition, the BSC employs a majority vote: if the component vectors have more ones than zeros in a dimension, this dimension will have a value of one, with ties broken at random.

In CHRR, binding through circular convolution is accomplished by pairwise multiplication: \(X \otimes Y = \{ X_1Y_1, X_2Y_2, \ldots .\,. X_{n-1} Y_{n-1}, X_nY_n \} \), which is equivalent to addition of the phase angles of the circular vectors concerned. Binding is inverted by binding to the inverse of the vector concerned: \(X \oslash Y = X \otimes Y^{\mathrm {-1}}\), where the inverse of a vector is its complex conjugate. Elemental vectors are initialized by randomly assigning a phase angle to each dimension. Superposition is accomplished by pairwise addition of the unit circle vectors, and normalization of the result for each circular component. In the implementation used in our experiments, normalization occurs after training concludes, so the sequence in which superposition occurs is not relevant.

Our real vector implementation follows the approach developed by Sahlgren and his colleagues [2], and differs from the binary and complex implementations, in that elemental vectors are “bound” to permutations, rather than to other vectors. Elemental vectors are constructed by creating a high-dimensional zero vector (on the order of 1000 dimensions), and setting a small number of the dimensions of this vector (on the order of 10) to either +1 or \(-1\) at random. The permutations utilized consist of shifting all of the elements of a given vector \(n\) positions to the right, where each value \(n\) is assigned to, or derived from, the information it is intended to encode. In the case of our orthographic model, this information consists of the character occurring in a particular position, so we have used the ASCII value of the character concerned as \(n\). Binding is reversed by permuting all of the elements of the vector \(n\) positions to the left. Superposition is accomplished by adding the vectors concerned, and normalizing the result.

In all models, the “random” initiation of elemental vectors is rendered deterministic by seeding the random number generator with a hash value derived from a string or character of interest following the approach developed in [18]. This retains the property of near-orthogonality where desired, while ensuring that random overlap between elemental vectors is consistent across experiments.

Fig. 1.
figure 1

Interpolation to generate five demarcator vectors. P(1) = probability of 1.

3.2 Measured Similarity

The first step in our approach involves generating a set of vectors that are a fixed distance apart, which we will refer to as demarcator vectors \((D({\mathrm {position}}))\), as illustrated in Fig. 1. The first pair of demarcator vectors are conventional elemental vectors \(D(\alpha )\) and \(D(\omega )\), constructed randomly such that they have a high probability of being mutually orthogonal or close-to-orthogonal. To ensure this with certainty, we render \(D(\omega )\) orthogonal to \(D(\alpha )\) using the quantum negation procedure, or its binary approximation, described in [19] and [20] respectively. The remaining demarcator vectors are generated by interpolation. In the continuous vector spaces, this is accomplished by subdividing the \(90^{\mathrm {o}}\) angle between \(D(\alpha )\) and \(D(\omega )\) and generating the corresponding unit vectors. In binary vector space, this is accomplished by weighting the probability of assigning a 1 when \(D(\alpha )\) and \(D(\omega )\) disagree in accordance with the desired distance between the new demarcator vector and each of these extremesFootnote 1. As the vectors representing adjacent numbers are approximately equidistant, the distances between vector pairs representing numbers the same distance apart should also be approximately equal (e.g. \(sim(D(1),D(2)) \approx sim(D(2),D(3))\)).

Table 1 illustrates the pairwise similarities between a set of five demarcator vectors constructed in this manner. In the binary case vectors of dimensionality 32,000 are used, in the complex case vectors of dimensionality 500 are used, and in the real case, vectors of dimensionality 1,000 are used. These dimensions were chosen so as to normalize the space requirements of the stored vectors across models, and were retained in our subsequent experiments. In all cases, the relatedness between demarcator vectors a fixed distance apart is approximately equal. For example, the similarity between all pairs of demarcator vectors two positions apart (e.g. 1 and 3) is approximately 0.5 in the binary implementation, and 0.71 in the complex and real vector implementations. In the binary implementation, the difference in relatedness is proportional to the difference in demarcator position. This is not the case in the complex or real implementations, where the drop in similarity between demarcator vectors becomes progressively steeper. This is an artifact of the metric used to measure similarity in each case. With binary vectors, \(2 \times (0.5 - {\mathrm { normalized\ Hamming\ distance}})\) is used, but with continuous vectors the cosine distance metric is used. While a proportional decrement could be obtained by measuring the angle between complex vectors directly (or taking the arccos of this cosine value), we will retain the use of the cosine metric for our experiments.

Table 1. Pairwise similarity between demarcator vectors

3.3 Encoding Orthography

Using controlled degrees of non-orthogonality, we can encode information about the positions of letters in words into their vector representations. Like spatial encoding [14] and the overlap model [13], our approach is based upon measuring the difference in position between matching characters. This is accomplished by creating elemental vectors for characters, and binding them to demarcator vectors representing positions. For example, the orthographic vector for the term “word” is constructed as follows:

$$\begin{aligned} S({\mathrm {word}}) = E(\mathrm {w}) \otimes D(1) + E(\mathrm {o}) \otimes D(2) + E(\mathrm {r}) \otimes D(3) + E(\mathrm {d}) \otimes D(4) \end{aligned}$$

As the elemental vectors for characters are mutually orthogonal or near-orthogonal, bound products derived from different characters will be orthogonal or near-orthogonal also. For example, in the complex vector space used to generate Table 1, \(sim(E(\mathrm {w}) \otimes D(\mathrm {1}) , E(\mathrm {q}) \otimes D(\mathrm {1})) = 0\). Furthermore, the distance between bound products containing the same character will approximate the distance between their demarcator vectors. For example, in the real vector space used to generate Table 1, \(sim(E(\mathrm {w}) \otimes D(\mathrm {1}) , E(\mathrm {w}) \otimes D(\mathrm {2})) = 0.92 = sim(D\mathrm ({1}), D\mathrm ({2}))\). Ultimately, the similarity between a pair of terms is derived from the distance between their matching characters. If this distance is generally low, the orthographic similarity between these terms will be highFootnote 2. Thus, the models so generated are innately tolerant to variations such as transposition, insertion and deletion of sequence elements.

Fig. 2.
figure 2

Orbital angular momentum vectors, as derived from quantized states along a given axis (left), and a related strategy for encoding positions as vectors

4 Demarcator Vectors and Orbital Angular Momentum

Evenly distributing normalized vectors between two orthogonal vectors is one of many strategies we could adopt to generate demarcator vectors. To generalize this process, we can describe it as follows:

  1. 1.

    Construct a line in the vector space with a given starting point and direction.

  2. 2.

    Place demarcators along this line using some dividing strategy.

  3. 3.

    (Optional) Project demarcator vectors onto the unit circle to normalize them.

Two examples of this strategy are illustrated in Fig. 2. In the example on the left, the generating line is the vertical axis, the dividing strategy is to mark points at even intervals along this axis, and the projection strategy is to project orthogonally from the vertical axis onto the unit sphereFootnote 3. In the example on the right, we have chosen a generating line parallel to the vertical axis, marked points at even intervals, and projected onto the unit sphere using standard vector normalization. The left-hand strategy will be familiar to some readers: this is precisely the way orbital angular momentum states are generated in quantum mechanics. We refer the reader to a text on quantum mechanics for this derivation, e.g., [21, Ch 14]: the process includes solving a wave equation in three dimensions, exploring the angular momentum operator and the commutator relations between its components, and noting that each point on the axis can be mapped to many on the outer sphere (this ambiguity corresponding to the fact that measuring the component of angular momentum along one axis must leave the component along the other axes undetermined according to the Uncertainty Principle). The important point for this discussion is that it is quite standard to derive non-orthogonal vectors for states in this fashion, not only in generalized quantum structures but in quantum mechanics itself. The underlying spherical harmonic functions involved in angular momentum are orthogonal to one another under pairwise integration, but lead to several possible non-orthogonal angular momentum vectors.

The ambiguity of the projection onto the unit sphere in the angular momentum model in more than two dimensions is problematic, and we expect the strategy on the right to be simpler in practice. Note also that both strategies do not distribute vectors evenly around the unit sphere: for example, in the right-hand strategy, the vectors representing positions 3 and 4 are closer to each other than the vectors representing positions 1 and 2. Such flexibility to vary these pairwise similarities between positions along a string is a desirable property, because changes at the beginning of a word may be more significant than changes in the middle [13]. We note also that in our current implementation in the Semantic Vectors package, this generalized strategy works well for real-valued and complex-valued vectors, but is as yet underspecified for binary-valued vectors because (for example) ‘the point half-way between \(A\) and \(B\)’ is multiply defined using Hamming distance [22]. We are currently investigating appropriate strategies to bring binary vectors into this generalized description.

5 Applications: Orthography, Morphology, Sentence Similarity

Table 2 provides examples of nearest-neighbor search based on orthographic similarity in real, complex and binary vector spaces derived from the widely-used Touchstone Applied Science Associates, Inc. (TASA) corpus. Terms in the corpus that occurred between 5 and 15,000 times were represented as candidates for retrieval. The dimensionality of the real, complex and binary vector spaces concerned was 1000, 500 and 32,000 respectively. Our approach successfully recovers orthographically related terms, including terms containing substrings of the original term (“dominic” vs. “condominium”); insertions (“orthography” vs. “orthophotography”); substitutions (“angular” vs. “annular”) and transposition of characters (“wahle” vs. “whale”). While not shown in the table on account of space constraints, the models produced similar sets of results for the same cue term.

Table 2. Orthographic similarity
Table 3. Comparison with benchmark conditions from [15]. 1X = original dimensionality (used for Table 2). 2X = twice original dimensionality. nb = no binding. \(\mathrm{ED} =1 - \frac{\mathrm {edit\ distance}}{\mathrm {combined\ length}}.\)

While we would be hesitant to propose the simple model of orthographic representation we have developed as a cognitive model of lexical coding, it is interesting to note that it does conform to the majority of a set of constraints abstracted from lexical priming data by Hannagan and his colleagues [15]. We will describe these constraints in brief here, but refer the interested reader to [15] for further details. The constraints are as follows: (1) Stability: a string should be most similar to itself (\({\mathrm {sim}} \ge 0.95\)); (2) Edge Effects: substitutions at the edges of strings should be more disruptive; (3) Local Translocations (TL): transposing adjacent characters should be less disruptive than substituting both of them; (4) Global TL: transposing all adjacent characters should be maximally disruptive (5) Distal TL: transposing non-adjacent characters should be more disruptive than substituting one, but less than substituting both; (6) Compound TL: TL and substitution should be more disruptive than substitution alone; (7) Distinct Relative Position (RP): removing some characters should preserve some similarity; and (8) Repeated RP: removing a repeated or non-repeated letter should be equally disruptiveFootnote 4. Each constraint is accompanied by a set of test cases, consisting of paired strings, and the degree to which a model meets the constraint is determined from the estimated similarities between these pairs, and the relationships between them.

The extent to which the our models meet these constraints is shown in Table 3. Estimates based on all models consistently meet all constraints aside from those related to Edge Effects, and the Global and Distal TL constraints (in the latter case this is due to the fact that translocation of characters one position apart is less disruptive than substituting one of these characters). This represents a better fit to these constraints than comparable models based on letter distribution only (labeled “nb”, or not bound). It also represents a better fit than the majority of approaches evaluated against this benchmark previously [15, 23], providing motivation for the further evaluation of a more developed model in the future. While the real model appeared to meet the edge effect related constraint at its original dimensionality, this finding did not hold at higher dimensionality, and was most likely produced by random overlap. This is not surprising given that our current model does not address edge effects. As suggested in Sect. 4, one way to address this issue would be to increase the distance between peripheral demarcator vectors, a customization we plan to evaluate in future work.

Table 4. Combining orthographic and semantic similarity

The results in Table 4 were obtained by superposing the orthographic vector for each term from the TASA corpus with a semantic vector for the same term generated using the permutation-based approach described in [2], with a 2+2 sliding window. As anticipated by previous work combining orthographic and semantic relatedness [24, 25], the examples suggest that this model is able to find associations between morphologically related terms, including those between English verb roots and past tense forms are related by non-affixal patterns such as “bring:brought”. The combination of semantics and shared characters is evident in other examples, such as “think:intend”. However, this sensitivity to morphological similarity comes at a cost of introducing false similarity when common letter patterns do not have a semantic significance.

Table 5. Retrieval of Sentences from the TASA Corpus in complex (first two examples) and binary (second two examples) spaces

6 Conclusion

In this paper we have introduced a novel approach through which the near-orthogonality of elemental vectors is deliberately violated to introduce measured similarity into semantic space. While illustrated primarily through orthographic modeling, the approach is general in nature and can be applied in any situation in which a representation of sequence that is tolerant to variation is desired. Furthermore, this approach may mediate the generation of holistic representations combining distributional and spatial information, a direction we plan to explore in future work. To facilitate further experimentation, our real, binary and complex orthographic vector implementations have been released as components of the open source Semantic Vectors package [27, 28].