Keywords

1 Introduction

The performance of machine- and deep learning in NLP heavily relies on the representation of natural language text. Therefore, succeeding with such models requires efficient preprocessing pipelines that produce effective data representations. Firstly, data representation influences the accuracy of the classifier by determining how much helpful information it can extract from raw data. Secondly, dense and high-dimensional representation models can be more costly to compute. Indeed, recent advances in deep neural networks (DNNs) have brought forward both the accuracy benefits and the complexity of NLP data representation.

Since natural language data is unstructured, encompassing multiple granularities, tasks, and domains, achieving sufficient natural language understanding is still challenging. Simultaneously, state-of-the-art language models like BERT and GPT-3 struggle with high computational complexity and lack of explainability [2, 35]. One might argue that attention is an explanation. However, attention merely highlights which part of the input the model used to produce its output. It does not break down the focus area into semantically meaningful units and cannot explain the ensuing reasoning process leading to an output decision [36]. Further, computation-wise, the complexity of attention is quadratic.

DNN NLP models usually represent words in vector space. Word2Vec is one early and widely used vector-based representation approach introduced by Mikolov et al. in 2013 [29]. In Word2Vec, a single-layer neural network learns the context of words and relates the words based on the inner product of context vectors. Similarly, GloVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics [31]. The cornerstone of the latter two approaches is the distributional hypothesis, which states that words with similar contexts have similar meanings. While boosting generalization ability by co-locating similar words in vector space, the dense vectors are expensive to compute and difficult to interpret.

The Tsetlin Machine (TM) is a logic-based pattern recognition approach that blends summation-based (cf. logistic regression) and rule-based approaches (cf. decision trees). Recent studies on TMs report promising performance in NLP, including sentiment analysis [44], novelty detection [6, 9], fake news detection [8], semantic relation analysis [34], and robustness toward counterfactual data [46].

The TM leverages propositional logic for interpretable modeling and bitwise operation for efficiency. Yet, recent TM research reports increasingly competitive NLP accuracy compared to deep learning at reduced complexity and increased efficiency. Simple AND rules, referred to as clauses, give these properties, employing set-of-words (SOW) as features. The clauses are self-contained, hence parallelizable [1]. Simultaneously, they can capture discriminative patterns that are interpretable [10].

Contributions: In this paper, we propose a representation learning framework for NLP classification utilizing TM. We use the TM clauses for supervised pretraining, building an abstract logical representation of the training data. We then show that the logical representation may be effective already after three epochs of training for six NLP classification tasks. We also evaluate our logic-based approach on twelve domain adaptation tasks from the Amazon dataset. Furthermore, as the learning of TM is human-interpretable, we provide a case study to explore the explainability of our representation.

2 Related Work

Conventional representation learning mostly focuses on feature engineering for data representation. For example, [23] introduced distributed representation for symbolic data, further developed in the context of statistical language modelling [4] and in neural net language models [3]. The neural language models are based on learning a distributed representation for each word, termed as a word embedding. One of the most common techniques in NLP is the bag of words (BOW) representation [49], extended to n-grams, topic modelling [42], and fuzzy BOW [50]. Other techniques include representing text as graphs [28]. However, because these models lack pre-trained knowledge, the representations produced are in general not robust, and consequently, they have degraded performance [47].

In recent years, there has been tremendous progress in NLP models employing pretrained language models [19, 24, 33]. Most of the state-of-the-art NLP solutions are today initialized using various pre-trained input data representations such as word2vec [29], GloVe [31], and FastText [12]. These word embeddings map words into informative low-dimensional vectors, which aid neural networks in computing and understanding languages. While the initialization of input using such embeddings has demonstrated improved performance in NLP tasks, adopting these sophisticated pretrained language models for data representation comes with a cost. First, the models are intrinsically complicated, being trained on immense amounts of data through fine-tuning of a very large number of parameters [2]. Second, as complexity rises, the interpretability of the input representation becomes more ambiguous [21]. One interpretation of such models is based on the attention mechanism, which assigns weights to input features. However, a more extensive investigation demonstrates that attention weights do not in general provide useful explanations [36].

TMs [22] are a recent rule-based machine learning approach that demonstrates competitive performance with DNN, providing human-interpretable rules using propositional reasoning [5]. Several studies have demonstrated the interpretability of TM, with competitive accuracy in comparison with other deep learning approaches. Examples of applications for TM include regression [17] , natural language understanding [6,7,8,9, 34, 44, 45], and speech understanding [26]. Furthermore, [10] analyzed the local and global interpretability of TM clauses, showing how the TM discrimination capability can be interpreted by inspecting each clause. However, these studies generally employ TM as a classifier. In this work, we exploit the data representations created by a TM to train computationally simple machine learning classifiers such as Support Vector Machine (SVM) and Logistic Regression (LR). Our intent is to develop rich context-specific language representations by using the clauses of a TM to capture the patterns and sub-patterns of data, utilized for later classification and domain adaptation tasks.

3 Data Representation Framework

3.1 Tsetlin Machine

A TM consists of dedicated teams of two-action Tsetlin Automata (TA) [41] that operate with boolean input and its negations. Each TA has 2N states (i.e., N states per action) and performs an action depending on its current state, which is either an “Include” action (in state 1 to N) or “Exclude” action (in state \(N+1\) to 2N). The TA states update based on reinforcement feedback in the form of rewards and penalties. Rewards strengthen the TA action, enhancing the confidence of the TA in the current action, whereas a penalty suppresses the action. The feedback helps the TA reach the optimal action, which is the one that maximizes the expected number of rewards. The learning of TM comes from multiple teams of TAs that build conjunctive clauses in propositional logic. During learning, each TM clause captures a specific sub-pattern, comprising negated and non-negated inputs. The output is decided by counting the number of matching sub-patterns recognized by the clauses.

The TM accepts a vector \(\textbf{x}=[x_1,\ldots ,x_o]\) of o propositional features as Boolean input, to be categorized into one of Cl classes, \(Y= (y_1, y_2, \ldots , y_{Cl})\), where Cl is the total number of classes. These features are then turned into a set of literals that comprises of the features themselves as well as their negated counterparts: \(L = \{x_1,\ldots ,x_o,\lnot {x}_1,\ldots ,\lnot {x}_o\}\).

Fig. 1.
figure 1

Knowledge representation framework.

If there are Cl classes and m sub-patterns per class, a TM employs \({Cl} \times m\) conjunctive clauses to express the sub-patterns. For a given classFootnote 1, we index its clauses by j, \(1 \le j \le m/2\), each clause being a conjunction of literals. In general, half of the clauses are assigned positive polarity, i.e., \(C_j^+\), and the other half are assigned negative polarity, i.e., \(C_j^-\). A clause \(C_j^\psi \), \(\psi \in \{-,+\},\) is produced by ANDing a subset of the literals, \(L_j^\psi \subseteq L\):

$$\begin{aligned} \textstyle C_j^\psi (\textbf{x})=\bigwedge _{l_k \in L_j^\psi } l_k. \end{aligned}$$
(1)

Here, \(l_k, 1 \le k \le 2o,\) is a feature or its negation. \(L_j^\psi \) is a subset of the literal set \(L\). For example, a particular clause \(C_j^+(\textbf{x}) = x_1 \wedge \lnot x_2\) consists of literals \(L_j^+ = \{x_1, \lnot x_2\}\) and it outputs 1 if \(x_1 = 1\) and \(x_2 = 0\).

The number of clauses m assigned to each class is user-configurable. The clause outputs are merged into a classification decision by summation and thresholding using the unit step function \(u(v) = 1 ~\textbf{if}~ v \ge 0 ~\textbf{else}~ 0\):

$$\begin{aligned} \textstyle \hat{y} = u\left( \sum _{j=1}^{m/2} C_j^+(\textbf{x}) - \sum _{j=1}^{m/2} C_j^-(\textbf{x})\right) . \end{aligned}$$
(2)

From Eq. (2), we can see that the classification is accomplished based on a majority vote, with the positive clauses voting for the class and the negative ones voting against it. For example, the classifier \(\hat{y} = u\left( x_1 \bar{x}_2 + \bar{x}_1 x_2 - x_1 x_2 - \bar{x}_1 \bar{x}_2\right) \) captures the XOR-relation. The TM learning involves guiding the TAs to take optimal actions. Each clause receives feedback for each round of training, which is transmitted to its individual TAs. The TM utilizes Type I and Type II feedback that governs the rewards and penalties distributed to the TAs. In short, Type I feedback is designed to develop frequent patterns, eliminate false negatives, and make clauses evaluate to 1. Type II feedback, on the other hand, enhances pattern discrimination, suppresses false positives, and makes clauses evaluate to 0. Both types of feedback allow clauses to learn numerous sub-patterns from data. The details of the learning process can be found in [22].

3.2 Data Representation

The trained TM is comprised of clauses that express sub-patterns in the data. In our NLP tasks, sub-patterns typically contain contextual combinations of words that explicitly characterize a specific class. In essence, the operation of TM in NLP consists of building rules by ANDing groups of word literals that occur together in similar contexts. As such, the clauses (rules) are contextually rich and specific, which we here exploit to build accurate representations. By being modular and decomposable, our representation also signifies which clauses are relevant for a given input. Since our representations are based on logical rules, we will refer to them as knowledge representations (cf. knowledge-based systems).

Our overall procedure is illustrated in Fig. 1 and can be detailed as follows. In brief, consider a trained TM with m clauses. Given the input text \(\textbf{x}=[x_1,\ldots ,x_o]\), we transform it into a representation consisting of logical rules:

$$\begin{aligned} TM _{ trans }^\textbf{x}= \zeta ^\textbf{x}= \mathbin \Vert _{y=1}^{Cl}[C_1^{y,+}(\textbf{x}), \ldots , C_{m/2}^{y,+}(\textbf{x}), C_1^{y,-}(\textbf{x}),\ldots , C_{m/2}^{y,-}(\textbf{x})]. \end{aligned}$$
(3)

Here, \(\mathbin \Vert _{y=1}^{Cl}\) denotes the array concatenation of the positive- and negative polarity clauses for class 1 to Cl. Each clause can be computed using Eq. (1).

Next, we perform feature compression since the transformed feature array produced in Eq. (3) can be too sparse for many machine learning algorithms. Assume that the total number of input examples is E and that each example is converted to a vector \(\zeta ^\textbf{x} = (x_1, x_2, \ldots , x_{d_\zeta })\), \(\zeta ^\textbf{x} \in R^{d_\zeta }\) of dimensionality \(d_\zeta = Cl \cdot m\). The dimensionality of the input matrix then becomes \(E \times d_\zeta \). We transform this input matrix further by centering the data: \(A=[\varOmega _1, \varOmega _2, \ldots , \varOmega _{E}]\). The center can be determined as follows:

$$\begin{aligned} \textbf{x}_r= \frac{\sum _{e=1}^{E} \textbf{x}_e}{E}, \end{aligned}$$
(4)
$$\begin{aligned} \varOmega _e = \textbf{x}_e- \textbf{x}_r. \end{aligned}$$
(5)

The covariance matrix of A is \(Cov(A,A) = E[(A - E[A])(A-E[A])^T] \) and it contains eigenvalues arranged in decreasing orders i.e., \(\gamma _1> \gamma _2> \ldots > \gamma _E\) with corresponding eigenvectors \(v_1, v_2, \ldots , v_E\). The set of original vectors can then be presented in the eigen space as follows:

$$\begin{aligned} \varOmega = \alpha _1v_1 + \alpha _2v_2+ \ldots + \alpha _E v_E = \sum _{i=1}^E \alpha _iv_i. \end{aligned}$$
(6)

After picking the top \(\mathcal {P}\) eigenvectors \(v_i\) and corresponding eigenvalues \(\gamma _i\), we have:

$$\begin{aligned} \varOmega = \alpha _1v_1 + \alpha _2v_2+ \ldots + \alpha _\mathcal {P}v_\mathcal {P}= \sum _{i=1}^\mathcal {P} \alpha _iv_i, \end{aligned}$$
(7)

where \(\mathcal {P}<<\mathcal {E}\). In the above equation, a vector of coefficients \([\alpha _1, \alpha _2, \ldots , \alpha _\mathcal {P}]\) represents the final representation formed in a Principal Component Analysis (PCA) space. We can observe that the number of dimensions is reduced while the most important features are retained by eigenvectors corresponding to large eigenvalues. Also, \(\mathcal {P}\) eigenvalues in \(\mathcal {E}\) are selected as follows:

$$\begin{aligned} \frac{\sum _{i=1}^{\mathcal {P}}\gamma _i}{\sum _{i=1}^{\mathcal {E}}\gamma _i} \geqslant \theta . \end{aligned}$$
(8)

Here, \(\theta \) can be 0.9 or 0.95. Now, each input example \(\varOmega _i\) can be expressed by a linear combination of \(\mathcal {P}\) eigen vectors \(\alpha _i= v_j^T \varOmega _i\), where \(j=1,2, \ldots , \mathcal {P}\), which is a compressed representation of the input given to the attached classifier, such as SVM or LR.

4 Experiments and Results

In this section, we evaluate the performance of the logical data representation created by transforming the input using the trained TMsFootnote 2.

4.1 Datasets

We conduct our experiments on six publicly accessible benchmark text classification datasets.

  • TREC-6 [13] is an open-domain, fact-based question classification dataset.

  • WebKB [16] comprises manually classified web pages gathered from the computer science departments of 4 universities and classified into 5 categories.

  • MPQA [43] is a dataset for detecting opinion polarity.

  • CR [20] is a customer review dataset with each sample labeled as positive or negative.

  • SUBJ [30] is a review classification into subjective or objective.

  • R8 [18] is a subset of the Reuters-21578 with 8 classes.

Table 1. Performance comparison (in accuracy) of vanilla TM with and without Knowledge representation in three stopping settings.

4.2 Implementation Details

We utilize the publicly accessible predefined train and test splits for all the datasets. The TM model is first initialized with three parameters: the number of clauses m, threshold T, and specificity s. We generate the TM representation under 3 settings: early stopping, mid stopping, and best stopping. The early stopping, mid stopping, and best stopping correspond to running our experiment with 3, 10, and 250 epochs. This enables us to observe the effect of quick TM convergence on the representation and classification performance. Following that, the TM model is trained for the different settings, and the training and testing input are transformed into respective representations using the trained model. The representations obtained at this point are uncompressed and in sparse Boolean form. Thereafter, the representation compression is done using PCA and Linear Discriminant Analysis (LDA)Footnote 3. Finally, the compressed representation is fed into a simple machine learning classifier such as linear SVM and LR, where the only features are the transformed vectors from TM. We repeat this procedure for each setting, and the results are reported in Subsect. 4.4.

4.3 Baselines

We compared the performance of our framework with deep learning and general pre-trained models. We adopted BERT [14] as a general pre-trained baseline. These models achieve state-of-the-art performance on a variety of NLP tasks. For deep learning models, we also included long short-term memory (LSTM) and convolutional neural network (CNN). We present the results of all the baseline models from the original papers. The LSTM model in our work is from [27], which represents the entire text based on the last hidden state. We use a BiLSTM model from [15, 39, 48]. The CNN models are taken from [15, 25], which employ pretrained word embedding for initialization. The result for DiSAN, which adopts directional self-attention, is taken from [38]. The BERT model is from [14].

Table 2. Performance comparison of our model with baseline algorithms. We reproduce the results with the same hyperparameter configurations for all baselines for a fair comparison and report average accuracy across 10 different random seeds.

4.4 Results and Analysis

The performance of our representation is compared to vanilla TM under 3 different settings (as explained in Subsect. 4.2) shown in Table 1. The \(TM_{vanilla}\) is the text classification using legacy TM. And \(TM_{rep}\) makes use of the representation generated by TM under various settings. On all datasets, we observe that the classification accuracy using the representation outperforms vanilla TM. For MPQA, we can see a massive improvement of around \(13\%\) followed by approximately \(4\%\) for TREC. With only 3 epochs, we can observe that the TM representation performs well, with the highest accuracy in the SUBJ dataset. This also demonstrates how the quick convergence of TM enables the generation of richer representation within a small number of epochs, hence benefiting representation production time.

Table 3. Computation for TM vs BERT in TREC dataset.

The performance result of our representation framework is compared with the other baselines (from Subsect. 4.3) in Table 2. We observe that our framework performs competitively with other baselines. The model beats all other baselines in WebKB and R8, with an accuracy of \(93.05\%\) and \(96.84\%\) respectively. WebKB largely entails classifying personal attributes captured by sparse data, such as categorizing individual students and professors from academia. The sparseness of the data may explain the poor performance of BERT according to a study that reports that BERT completely ignores the minority class at test time for low-resource tasks such as few-shot learning and rare entity recognition [40]. For TREC, MPQA, and CR, BERT outperforms all other models. Our model, on the other hand, achieves \(95.6\%\) on TREC, placing it as the second-best model in terms of performance. LSTM performs the worst of all models, whereas BiLSTM performs competitively. Surprisingly, a basic CNN model with static vectors gives competitive results against the more sophisticated attention-based DiSAN model. We see that the performance of vanilla TM falls short when compared with models initialized with pre-trained word embeddings. Overall, the TM representation performs competitively compared with computationally intensive models such as BERT, which is trained on a big text corpus. Training time and memory consumption of TM and BERT are shown in Table 3. To calculate training time, we run both TM and BERT for 10 epochs. Note that the TM is trained entirely from scratch, whereas the BERT is simply fine-tuned. We observe that the TM outperforms BERT by \(3 \times \) in terms of training time and memory utilization.

4.5 Visualization

To investigate how our representation enhances the performance on NLP tasks, we plot the learned representation using t-SNE. We visualize the input with and without the TM representation in Fig. 3. Additionally, the figure contains the corresponding BERT hidden layer representationFootnote 4. We observe that both representations provide richer and more precise information because the clusters get more separated and clear-cut. Further, notice how the TM clauses compress the data, significantly reducing the number of distinct data points (Fig. 3b). Each data point relates important features, formulated in propositional logic. Additionally, we demonstrate in Fig. 2 how the number of epochs influences the representation. We observe that as the number of epochs increases, the cluster becomes increasingly compact and distinct.

Fig. 2.
figure 2

Visualization of t-SNE projection of representation produced in 3 training settings on TREC dataset.

Fig. 3.
figure 3

Visualization of t-SNE projection for raw data, TM, and BERT representation on TREC dataset.

4.6 Concluding Remarks

From the above empirical results and visual analysis, our conclusion is that the TM representation considerably enhances the input feature space, resulting in enhanced performance. As explored further below in the case study on interpretability, the advantage of using TM can at least partially be explained by its ability to capture both semantic and structural word representations from the input. Additionally, unlike DNNs, our model provides a reasonable trade-off between performance and explainability. That is, the TM representation is computationally simple and explainable through the logic-based propositional rules composed by the clauses.

5 A Case Study: Interpretability

In this section, we demonstrate a case study showing how the representation produced by our framework is interpretable. Let us assume the following input sentence from the TREC dataset: “what is the highest waterfall in the united states?” with the label “Location” and “what is the date of boxing day?” with the label “Entity”. After tokenization, the vocabulary will consist of the following tokens: [“what”, “highest”, “waterfall”, “united”, “states”, “date”, “boxing”, “day”]. During training, the clauses capture the distinctive pattern to designate each label. Figure 4 contains some sample clauses that support “Location".

Fig. 4.
figure 4

Literals captured by clauses.

Referring to Fig. 1, the given input can be represented using TM clauses from a trained model. The representation can be written in an array: \([C_1, C_2, \ldots , C_m] \rightarrow [1, 0, \ldots , 1]\). For a given input, the representation consists of an array of clauses that are activated. And the clauses that are activated encapsulate the propositional rules necessary to make the correct classification decision. As a result, the representation is dense with information and can be completely interpretable. For example, for the above input, \(C_1\) and \(C_m\) in Fig. 4 are activated in the representation. The vocabulary encompassed by these clauses are [“highest”, “waterfall”, “united”, “states”]. That is, these clauses encapsulate the propositional rules associated with the label “Location”.

Table 4. Domain adaptation performance (accuracy %) on Amazon review dataset.

6 Application: Domain Adaptation

We here demonstrate that the input representations produced from our framework can be used in domain adaptation tasks. These results thus reinforce our previous conclusion that the representations are rich, and also applicable as contexts in cross-domain applications. We employ Amazon reviews datasets [11], which comprises 4 domains: Books (B), DVD (D), Electronic (E), and Kitchen & Housewares (K), with 12 adaptation scenarios. Each domain has around 2000 labeled and approximately 4000 unlabeled reviews. We follow the transductive setting in [32] to train in the source domain and test in the target domains. For a fair comparison, the results for the baseline algorithms are obtained directly from [37, 51]. The results are summarized in Table 4. As shown, the new approach can outperform baseline algorithms in 9 out of 12 tasks. And on average, our model beats all the other algorithms.

7 Conclusion

In this paper, we propose a data representation framework that enhances the performance of Tsetlin Machines (TMs). Our approach is capable of producing more sophisticated data representation through the utilization of semantic and contextual patterns captured by clauses in TMs. We conduct extensive experiments on NLP classification and domain adaptation using publicly available datasets. In NLP classification, our experimental findings suggest that our method is competitively equal to complicated and non-transparent DNNs, including BERT. In domain adaptation, we outperform all other baselines, illustrating that the representation produced from our framework can be employed in cross-domain applications. Additionally, using a t-SNE plot, we visualize how the representation can enhance input features by utilizing distinctive decision boundaries for each class. Finally, we present a case study demonstrating the interpretability of TM-generated representation.