Keywords

1 Introduction

The Self-Organizing Map (SOM) algorithm [25, 41] has been proven to be an effective technique for unsupervised machine learning and dimension reduction of multi-dimensional data. A broad range of applications ranging from its conventional use in 2-D visualization of multi-dimensional data to more recent developments such as analysis of energy consumption patterns in urban environments [6, 8], autonomous video surveillance [29], multimodal data fusion [14], incremental change detection [28], learning models from spiking neurons [12], and identification of social media trends [3, 7]. The latter use-case is an example of an entire application domain of SOMs for learning on symbolic data. This type of data is typically present in various tasks of natural language processing.

As the SOM uses weight vectors of fixed dimensionality, this dimensionality must be equal to the dimensionality of the input data. A conventional approach for feeding variable length symbolic data into the SOM is to obtain a fixed length representation through n-gram statistics (e.g., bigrams when \(n=2\) or trigrams when \(n=3\)). The n-gram statistics, which is a vector of all possible combinations of n symbols of the data alphabet, is calculated during a pre-processing routine, which populates the vector with occurrences of each n-gram in the symbolic data. An obvious computational bottleneck of such approach is due to the length of n-gram statistics, which grows exponentially with n. Since the vector is typically sparse some memory optimization is possible on the data input side. For example, only the indices of non-zero positions can be presented to the SOM. This, however, does not help with the distance calculation, which is the major operation of the SOM. Since weight vectors are dense, for computing the distances the input vectors must be unrolled to their original dimensionality. In this paper, we present an approach where the SOM uses mappings of n-gram statistics instead of the conventional n-gram statistics. Mappings are vectors of fixed arbitrary dimensionality, where the dimensionality can be substantially lower than the number of all possible n-grams.

Outline of the proposed approach

Fig. 1.
figure 1

Outline of the conventional approach.

The core of the proposed approach is in the use of hyperdimensional computing and distributed data representation. Hyperdimensional computing is a bio-inspired computational paradigm in which all computations are done with randomly generated vectors of high dimensionality. Figure 1 outlines the conventional approach of using n-gram statistics with SOMs. First, for the input symbolic data we calculate n-gram statistics. The size of the vector \(\mathbf s \), which contains the n-gram statistics, will be determined by the size of the data’s alphabet a and the chosen n. Next, the conventional approach will be to use \(\mathbf s \) as an input \(\mathbf x \) to either train or test the SOM (the red vertical line in Fig. 1). The approach proposed in this paper modifies the conventional approach by introducing an additional step, as outlined in Fig. 2. The blocks in green denote the elements of the introduced additional step. For example, the item memory stores the distributed representations of the alphabet. In the proposed approach, before providing \(\mathbf s \) to the SOM, \(\mathbf s \) is mapped to a distributed representation \(\mathbf h \), which is then used as an input to the SOM (the red vertical line in Fig. 2).

The paper is structured as follows. Section 2 describes the related work. Section 3 presents the methods used in this paper. Section 4 reports the results of the experiments. The conclusions follow in Sect. 5.

Fig. 2.
figure 2

Outline of the proposed approach.

2 Related Work

The SOM algorithm [25] was originally designed for metric vector spaces. It develops a non-linear mapping of a high-dimensional input space to a two-dimensional map of nodes using competitive, unsupervised learning. The output of the algorithm, the SOM represents an ordered topology of complex entities [26], which is then used for visualization, clustering, classification, profiling, or prediction. Multiple variants of the SOM algorithm that overcome structural, functional and application-focused limitations have been proposed. Among the key developments are the Generative Topographic Mapping based on non-linear latent variable modeling [4], the Growing SOM (GSOM) that addresses the predetermined size constraints [1], the TASOM based on adaptive learning rates and neighborhood sizes [40], the WEBSOM for text analysis [17], and the IKASL algorithm [5] that addresses challenges in incremental unsupervised learning. Moreover, recently an important direction is the simplification of the SOM algorithm [2, 19, 39] for improving its speed and power-efficiency.

However, only a limited body of work has explored the plausibility of the SOM beyond its original metric vector space. In contrast to a metric vector space, a symbolic data space is a non-vectorial representation that possesses an internal variation and structure which must be taken into account in computations. Records in a symbolic dataset are not limited to a single value, for instance, each data point can be a hypercube in p-dimensional space or Cartesian product of distribution. In [26], authors make the first effort to apply SOM algorithm to symbol strings, the primary challenges were the discrete nature of data points and adjustments required for the learning rule, addressed using the generalized means/medians and batch map principle. Research reported in [42] takes a more direct approach to n-gram modeling of HTTP requests from network logs. Feature matrices are formed by counting the occurrences of n-characters corresponding to each array in the HTTP request, generating a memory-intensive feature vector of length \(256^n\). Feature matrices are fed into a variant of the SOM, Growing Hierarchical SOMs [9] to detect anomalous requests. Authors report both accuracy and precision of 99.9% on average, when using bigrams and trigrams. Given the limited awareness and availability of research into unsupervised machine learning on symbolic data, coupled with the increasing complexity of raw data [27], it is pertinent to investigate the functional synergies between hyperdimensional computing and the principles of SOMs.

Fig. 3.
figure 3

Illustration of a self-organizing map with nine nodes organized according to the grid topology.

3 Methods

This section presents the methods used in this paper. We describe: the basics of the SOM algorithm; the process of collecting n-gram statistics; the basics of hyperdimensional computing; and the mapping of n-gram statistics to the distributed representation using hyperdimensional computing.

3.1 Self-organizing Maps

A SOM [25] (see Fig. 3) consists of a set of nodes arranged in a certain topology (e.g., a rectangular or a hexagonal grid or even a straight line). Each node j is characterized by a weight vector of dimensionality equal the dimensionality of an input vector (denoted as \(\mathbf x \)). The weight vectors are typically initialized at random. Denote a \(u \times k\) matrix of k-dimensional weight vectors of u nodes in a SOM as \(\mathbf W \). Also denote a weight vector of node j as \(\mathbf W _j\) and i’th positions of this vector as \(\mathbf W _{ji}\). One of the main steps in the SOM algorithm is for a given input vector \(\mathbf x \) to identify the winning node, which has the closest weight vector to \(\mathbf x \). Computation of a distance between the input \(\mathbf x \) and the weight vectors in \(\mathbf W \), the winner takes all procedure as well as the weight update rule are the main components of SOM logic. They are outlined in the text below.

In order to compare \(\mathbf x \) and \(\mathbf W _j\), a similarity measure is needed. The SOM uses Euclidian distance:

$$\begin{aligned} D(\mathbf x ,\mathbf W _j)=\sqrt{\sum _{i=1}^{i=k} (\mathbf x _i-\mathbf W _{ji})^2 }, \end{aligned}$$
(1)

where \(\mathbf x _i\) and \(\mathbf W _{ji}\) are the corresponding values of ith positions. The winning node (denoted as w) is defined as a node with the lowest Euclidian distance to the input \(\mathbf x \).

In the SOM, a neighborhood \(\mathcal {M}\) of nodes around the winning node w is selected and updated; the size of the neighborhood progressively decreases:

$$\begin{aligned} \gamma (j,w,t) = e^{-l(j,w)/2\sigma (t)^2 }, \end{aligned}$$
(2)

where l(jw) is the lateral distance between a node j and the winning node w on the SOM’s topology; \(\sigma (t)\) is the decreasing function, which depends of the current training iteration t. If a node j is within the neighborhood \(\mathcal {M}\) of w then the weight vector \(\mathbf W _{j}\) is updated with:

$$\begin{aligned} \triangle \mathbf W _{j}=\eta (t) \gamma (j,w,t) (\mathbf x -\mathbf W _{j}), \end{aligned}$$
(3)

where \(\eta (t)\) denotes the learning rate decreasing with increasing t. During an iteration t, the weights are updated for all available training inputs \(\mathbf x \). The training process usually runs for T iterations.

Once the SOM has been trained it could be used in the operating phase. The operating phase is very similar to that of the training one except that the weights stored in \(\mathbf W \) are kept fixed. For a given input \(\mathbf x \), the SOM identifies the winning node w. This information is used depending on the task at hand. For example, in clustering tasks, a node could be associated with a certain region. In this paper, we consider the classification task, and therefore, each node would have an assigned classification label.

3.2 n-gram Statistics

In order to calculate n-gram statistics for the input symbolic data \(\mathcal {D}\), which is described by the alphabet of size a, we first initialize an empty vector \(\mathbf s \). This vector will store the n-gram statistics for \(\mathcal {D}\), where the ith position in \(\mathbf s \) corresponds to an n-gram \(\varvec{\mathcal {N}}_i=\langle \mathcal {S}_1, \mathcal {S}_2, \dots , \mathcal {S}_n, \rangle \) from the set \(\varvec{\mathcal {N}}\) of all unique n-grams; \(\mathcal {S}_j\) corresponds to a symbol in jth position of \(\varvec{\mathcal {N}}_i\). The value \(\mathbf s _i\) indicates the number of times \(\varvec{\mathcal {N}}_i\) was observed in the input symbolic data \(\mathcal {D}\). The dimensionality of \(\mathbf s \) is equal to the total number of n-grams in \(\varvec{\mathcal {N}}\), which in turn depends on a and n (size of n-grams) and is calculated as \(a^n\) (i.e., \(\mathbf s \in [a^n \times 1]\)). The n-gram statistics \(\mathbf s \) is calculated via a single pass through \(\mathcal {D}\) using the overlapping sliding window of size n, where for an n-gram observed in the current window the value of its corresponding position in \(\mathbf s \) (i.e., counter) is incremented by one. Thus, \(\mathbf s \) characterizes how many times each n-gram in \(\varvec{\mathcal {N}}\) was observed in \(\mathcal {D}\).

3.3 Hyperdimensional Computing

Hyperdimensional computing [16, 31, 33, 34] also known as Vector Symbolic Architectures is a family of bio-inspired methods of representing and manipulating concepts and their meanings in a high-dimensional space. Hyperdimensional computing finds its applications in, for example, cognitive architectures [10], natural language processing [20, 38], biomedical signal processing [22, 35], approximation of conventional data structures [23, 30], and for classification tasks [18], such as gesture recognition [24], physical activity recognition [37], fault isolation [21]. Vectors of high (but fixed) dimensionality (denoted as d) are the basis for representing information in hyperdimensional computing. These vectors are often referred to as high-dimensional vectors or HD vectors. The information is distributed across HD vector’s positions, therefore, HD vectors use distributed representations. Distributed representations [13] are contrary to the localist representations (which are used in the conventional n-gram statistics) since any subset of the positions can be interpreted. In other words, a particular position of an HD vector does not have any interpretable meaning – only the whole HD vector can be interpreted as a holistic representation of some entity, which in turn bears some information load. In the scope of this paper, symbols of the alphabet are the most basic components of a system and their atomic HD vectors are generated randomly. Atomic HD vectors are stored in the so-called item memory, which in its simplest form is a matrix. Denote the item memory as \(\mathbf H \), where \(\mathbf H \in [d \times a]\). For a given symbol \(\mathcal {S}\) its corresponding HD vector from \(\mathbf H \) is denoted as \(\mathbf H _{\mathcal {S}}\). Atomic HD vectors in \(\mathbf H \) are bipolar (\(\mathbf H _{\mathcal {S}} \in \{-1, +1\}^{[d \times 1]}\)) and random with equal probabilities for \(+1\) and \(-1\). It is worth noting that an important property of high-dimensional spaces is that with an extremely high probability all random HD vectors are dissimilar to each other (quasi orthogonal).

In order to manipulate atomic HD vectors hyperdimensional computing defines operations and a similarity measure on HD vectors. In this paper, we use the cosine similarity for characterizing the similarity. Three key operations for computing with HD vectors are bundling, binding, and permutation.

The binding operation is used to bind two HD vectors together. The result of binding is another HD vector. For example, for two symbols \(\mathcal {S}_1\) and \(\mathcal {S}_2\) the result of binding of their HD vectors (denotes as \(\mathbf b \)) is calculated as follows:

$$\begin{aligned} \mathbf b = \mathbf H _{\mathcal {S}_1} \odot \mathbf H _{\mathcal {S}_2}, \end{aligned}$$
(4)

where the notation \(\odot \) for the Hadamard product is used to denote the binding operation since this paper uses positionwise multiplication for binding. An important property of the binding operation is that the resultant HD vector \(\mathbf b \) is dissimilar to the HD vectors being bound, i.e., the cosine similarity between \(\mathbf b \) and \(\mathbf H _{\mathcal {S}_1}\) or \(\mathbf H _{\mathcal {S}_2}\) is approximately 0.

An alternative approach to binding when there is only one HD vector is to permute (rotate) the positions of the HD vector. It is convenient to use a fixed permutation (denoted as \(\rho \)) to bind a position of a symbol in a sequence to an HD vector representing the symbol in that position. Thus, for a symbol \(\mathcal {S}_1\) the result of permutation of its HD vector (denotes as \(\mathbf r \)) is calculated as follows:

$$\begin{aligned} \mathbf r = \rho ( \mathbf H _{\mathcal {S}_1}). \end{aligned}$$
(5)

Similar to the binding operation, the resultant HD vector \(\mathbf r \) is dissimilar to \(\mathbf H _{\mathcal {S}_1}\).

The last operation is called bundling. It is denoted with \(+\) and implemented via positionwise addition. The bundling operation combines several HD vectors into a single HD vector. For example, for \(\mathcal {S}_1\) and \(\mathcal {S}_2\) the result of bundling of their HD vectors (denotes as \(\mathbf a \)) is simply:

$$\begin{aligned} \mathbf a = \mathbf H _{\mathcal {S}_1} + \mathbf H _{\mathcal {S}_2}. \end{aligned}$$
(6)

In contrast to the binding and permutation operations, the resultant HD vector \(\mathbf a \) is similar to all bundled HD vectors, i.e., the cosine similarity between \(\mathbf b \) and \(\mathbf H _{\mathcal {S}_1}\) or \(\mathbf H _{\mathcal {S}_1}\) is more than 0. Thus, the bundling operation allows storing information in HD vectors [11]. Moreover if several copies of any HD vector are included (e.g., \(\mathbf a = 3\mathbf H _{\mathcal {S}_1} + \mathbf H _{\mathcal {S}_2}\)), the resultant HD vector is more similar to the dominating HD vector than to other components.

3.4 Mapping of n-gram Statistics with Hyperdimensional Computing

The mapping of n-gram statistics into distributed representation using hyperdimensional computing was first shown in [15]. At the initialization phase, the random item memory \(\mathbf H \) is generated for the alphabet. A position of symbol \(\mathcal {S}_j\) in \(\varvec{\mathcal {N}}_i\) is represented by applying the fixed permutation \(\rho \) to the corresponding atomic HD vector \(\mathbf H _{\mathcal {S}_j}\) j times, which is denoted as \(\rho ^{j}(\mathbf H _{\mathcal {S}_j})\). Next, a single HD vector for \(\varvec{\mathcal {N}}_i\) (denoted as \(\mathbf m _{\varvec{\mathcal {N}}_i}\)) is formed via the consecutive binding of permuted HD vectors \(\rho ^{j}(\mathbf H _{\mathcal {S}_j})\) representing symbols in each position j of \(\varvec{\mathcal {N}}_i\). For example, for the trigram ‘cba’ will be mapped to its HD vector as follows: \( \rho ^{1}(\mathbf H _{\text {c}}) \odot \rho ^{2}(\mathbf H _{\text {b}}) \odot \rho ^{3}(\mathbf H _{\text {a}}) \). In general, the process of forming HD vector of an n-gram can be formalized as follows:

$$\begin{aligned} \mathbf m _{\varvec{\mathcal {N}}_i} = \prod _{j=1}^{n} \rho ^{j}(\mathbf H _{\mathcal {S}_j}), \end{aligned}$$
(7)

where \(\prod \) denotes the binding operation (positionwise multiplication) when applied to n HD vectors.

Once it is known how to map a particular n-gram to an HD vector, mapping the whole n-gram statistics \(\mathbf s \) is straightforward. HD vector \(\mathbf h \) corresponding to \(\mathbf s \) is created by bundling together all n-grams observed in the data, which is expressed as follows:

$$\begin{aligned} \mathbf h =\sum _{i=1}^{a^n} \mathbf s _i \mathbf m _{\varvec{\mathcal {N}}_i} = \sum _{i=1}^{a^n} \mathbf s _i \prod _{j=1}^{n} \rho ^{j}(\mathbf H _{\mathcal {S}_j}), \end{aligned}$$
(8)

where \(\sum \) denotes the bundling operation when applied to several HD vectors. Note that \(\mathbf h \) is not bipolar, therefore, in the experiments below we normalized it by its \(\ell _2\) norm.

4 Experimental Results

This section describes the experimental results studying several configurations of the proposed approach and comparing it with the results obtained for the conventional n-gram statistics. We slightly modified the experimental setup from that used in [15], where the task was to identify a language of a given text sample (i.e., for a string of symbols). The language recognition was done for 21 European languages. The list of languages is as follows: Bulgarian, Czech, Danish, German, Greek, English, Estonian, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. The training data is based on the Wortschatz Corpora [32]. The average size of a language’s corpus in the training data was \(1,085,637.3 \pm 121,904.1\) symbols. It is worth noting, that in the experiments reported in [15] the whole training corpus of a particular language was used to estimate the corresponding n-grams statistics. While in this study, in order to enable training of SOMs, each language corpus was divided into samples where the length of each sample was set to 1, 000 symbols. The total number of samples in the training data was 22, 791. The test data is based on the Europarl Parallel CorpusFootnote 1. The test data also represent 21 European languages. The total number of samples in the test data was 21, 000, where each language was represented with 1, 000 samples. Each sample in the test data corresponds to a single sentence. The average size of a sample in the test data was \(150.3 \pm 89.5\) symbols.

The data for each language was preprocessed such that the text included only lower case letters and spaces. All punctuation was removed. Lastly, all text used the 26-letter ISO basic Latin alphabet, i.e., the alphabet for both training and test data was the same and it included 27 symbols. For each text sample the n-gram statistics (either conventional or mapped to the distributed representation) was obtained, which was then used as input \(\mathbf x \) when training or testing SOMs. Since each sample was preprocessed to use the alphabet of only \(a=27\) symbols, the conventional n-gram statistics input is \(27^{n}\) dimensional (e.g., \(k=729\) when \(n=2\)) while the dimensionality of the mapped n-gram statistics depends on the dimensionality of HD vectors d (i.e., \(k=d\)). In all experiments reported in this paper, we used the standard SOMs implementation, which is a part of the Deep Learning Toolbox in MATLAB R2018B (Mathworks Inc, Natick, Ma).

During the experiments, certain parameters SOM were fixed. In particular, the topology of SOMs was set to the standard grid topology. The initial size of the neighborhood was always fixed to ten. The size of the neighborhood and the learning rate were decreasing progressively with training according to the default rules of the used implementation. In all simulations, a SOM was trained for a given number of iterations T, which was set according to an experiment reported in Fig. 4. All reported results were averaged across five independent simulations. The bars in the figure show standard deviations.

Fig. 4.
figure 4

The classification accuracy of the SOM trained on the conventional bigram statistics (\(n=2\); \(k=729\)) against the number of training iterations T. The grid size was set to ten (\(u=100\)). T varied in the range [5, 100] with step 5.

Recall that SOMs are suited for the unsupervised training, therefore, an extra mechanism is needed to use them in supervised tasks such as the considered language recognition task, i.e., once the SOM is trained there is still a need to assign a label to each trained node. After training a SOM for T iterations using all 22, 791 training samples, the whole training data were presented to the trained SOM one more time without modifying \(\mathbf W \). Labels for the training data were used to collect the statistics for the winning nodes. The nodes were assigned the labels of the languages dominating in the collected statistics. If a node in the trained SOM was never chosen as the winning node for the training samples (i.e., its statistics information is empty) then this node was ignored during the testing phase. During the testing phase, 21, 000 samples of the test data were used to assess the trained SOM. For each sample in the test data, the winning node was determined. The test sample then was assigned the language label corresponding to its winning node. The classification accuracy was calculated using the SOM predictions and the ground truth of the test data. The accuracy was used as the main performance metric for evaluation and comparison of different SOMs. It is worth emphasizing that the focus of experiments is not on achieving the highest possible accuracy but on a comparative analysis of SOMs with the conventional n-gram statistics versus SOMs with the mapped n-gram statistics with varying d. However, it is worth noting that the accuracy, obtained when collecting an n-gram statistics profile for each language [15, 36] for \(n=2\) and \(n=3\) and using the nearest neighbor classifier, was 0.945 and 0.977 respectively. Thus, the results presented below for SOMs match the ones obtained with the supervised learning on bigrams when the number of nodes is sufficiently high. In the case of trigrams, the highest accuracy obtained with SOMs was slightly (about 0.02) lower. While SOMs not necessarily achieve the highest accuracy compared to the supervised methods, their important advantage is data visualization. For example, in the considered task one could imagine using the trained SOM for identifying the clusters typical for each language and even reflecting on their relative locations on the map.

Fig. 5.
figure 5

The classification accuracy of the SOM against the grid size for the case of bigram statistics. The grid size varied in the range [2, 20] with step 2.

Fig. 6.
figure 6

The training time of the SOM against the grid size for the case of bigram statistics. The grid size varied in the range [2, 20] with step 2.

The experiment in Fig. 4 presents the classification accuracy of the SOM trained on the conventional bigram statistics against T. The results demonstrated that the accuracy increased with the increased number T. Moreover, for higher values of T the predictions are more stable. The performance started to saturate at T more than 90, therefore, in the other experiments the value of T was fixed to 100.

The grid size varied in the range [2, 20] with step 2, i.e, the number of nodes u varied between 4 and 400. In Fig. 5 the solid curve corresponds to the SOM trained on the conventional bigram statistics. The dashed, dash-dot, and dotted curves correspond to the SOMs trained on the mapped bigram statistics with \(k=d=500\), \(k=d=300\), and \(k=d=100\) respectively.

The experiment presented in Fig. 5 studied the classification accuracy of the SOM against the grid size for the case of bigram statistics. Note that the number of nodes u in the SOM is proportional to the square of the grid size. For example, when the gris size equals 2 the SOM has \(u=4\) nodes while when it equals 20 the SOM has \(u=400\) nodes. The results in Fig. 5 demonstrated that the accuracy of all considered SOMs improves with the increased grid size. It is intuitive that all SOMs with grid sizes less than five performed poorly since the number of nodes in SOMs was lower than the number of different languages in the task. Nevertheless, the performance of all SOMs was constantly improving with the increased grid size, but the accuracy started to saturate at about 100 nodes. Moreover, increasing the dimensionality of HD vectors d was improving the accuracy. Note, however, that there was a better improvement when going from \(d=100\) to \(d=300\) compared to increasing d from 300 to 500. The performance of the conventional bigram statistics was already approximated well even when \(d=300\); for \(d=500\) the accuracy was just slightly worse than that of the conventional bigram statistics.

Fig. 7.
figure 7

The classification accuracy of the SOM trained on the mapped bigram statistics (\(n=2\)) against the dimensionality of HD vectors d (\(k=d\)). The grid size was set to 16 (\(u=256\)). The number of training iterations T was fixed to 100.

It is important to mention that the usage of the mapped n-grams statistics allows decreasing the size of \(\mathbf W \) in proportion to \(d/a^n\). Moreover, it allows decreasing the training time of SOMs. The experiment in Fig. 6 presents the training time of the SOM against the grid size for the case of bigram statistics. Figure 6 corresponds to that of Fig. 5. The number of training iterations was fixed to \(T=100\). For example, for grid size 16 the average training time on a laptop for \(k=d=100\) was 2.7 min (accuracy 0.86); for \(k=d=300\) it was 8.0 min (accuracy 0.91); for \(k=d=500\) it was 16.9 min (accuracy 0.92); and for \(k=a^n=729\) it was 27.3 min (accuracy 0.93). Thus, the usage of the mapping allows the trade-off between the obtained accuracy and the required computational resources.

In order to observe a more detailed dependency between the classification accuracy and the dimensionality of distributed representations d of the mapped n-gram statistics, an additional experiment was done. Figure 7 depicts the results. The dimensionality of distributed representations d varied in the range [20, 1000] with step 20. It is worth mentioning that even for small dimensionalities (\(d<100\)), the accuracy is far beyond random. The results in Fig. 7 are consistent with the observations in Fig. 5 in a way that the accuracy was increasing with the increased d. The performance saturation begins for the values above 200 and the improvements beyond \(d=500\) look marginal. Thus, we experimentally observed that the quality of mappings grows with d, however, after a certain saturation point increasing d further becomes impractical.

Fig. 8.
figure 8

The classification accuracy of the SOM against the grid size for the case of trigram statistics (\(n=3\)). The number of training iterations T was fixed to 100.

The last experiment in Fig. 8 is similar to Fig. 5 but it studied the classification accuracy for the case of trigram statistics (\(n=3\)). The grid size varied in the range [2, 20] with step 2. The solid curve corresponds to the SOM trained on the conventional trigram statistics (\(k=27^3=19,683\)). The dashed and dash-dot curves correspond to the SOMs trained on the mapped trigram statistics with \(k=d=5,000\) and \(k=d=1,000\) respectively. The results in Fig. 8 are consistent with the case of bigrams. The classification of SOMs was better for higher d and even when \(d<a^n\) the accuracy was approximated well.

5 Conclusions

This paper presented an approach for the mapping of n-gram statistics into vectors of fixed arbitrary dimensionality, which does not depend on the size of n-grams n. The mapping is aided by hyperdimensional computing a bio-inspired approach for computing with large random vectors. Mapped in this way n-gram statistics is used as the input to Self-Organized Maps. This novel for Self-Organized Maps step allows removing the computational bottleneck caused by the exponentially growing dimensionality of n-gram statistics with increased n. While preserving the performance of the trained Self-Organized Maps (as demonstrated in the languages recognition task) the presented approach results in reduced memory consumption due to smaller weight matrix (proportional to d and u) and shorter training times. The main limitation of this study is that we have validated the proposed approach only on a single task when using the conventional Self-Organized Maps. However, it is worth noting that the proposed approach could be easily used for other modifications of the conventional Self-Organizing Maps such as Growing Self-Organizing Maps [1], where dynamic topology preservation facilitates unconstrained learning. This is in contrast to a fixed-structure feature map as the map itself is defined by the unsupervised learning process of the feature vectors. We intend to investigate distributed representation of n-gram statistics in structure-adapting feature maps in future work.