Keywords

1 Introduction

The recently introduced Regression Tsetlin Machine (RTM) [1, 2] is a propositional logic based approach to interpretable non-linear regression, founded on the Tsetlin Machine (TM) [3]. Being based on disjunctive normal form, like Karnaugh maps, the TM can map an exponential number of input feature value combinations to an appropriate output [4]. Recent research reports several distinct TM properties. The clauses that a TM produces have an interpretable form (e.g., if X satisfies condition A and not condition B then Y = 1), similar to the branches in a decision tree [5]. For small-scale pattern recognition problems, up to three orders of magnitude lower energy consumption and inference time has been reported, compared to neural networks alike [6]. Like neural networks, the TM can be used in convolution, providing competitive memory usage, computation speed, and accuracy on MNIST, F-MNIST and K-MNIST, in comparison with simple 4-layer CNNs, K-Nereast Neighbors, SVMs, Random Forests, Gradient Boosting, BinaryConnect, Logistic Circuits, and ResNet [7]. By introducing clause weights that allow one clause to represent multiple, it has been demonstrated that the number of clauses can be reduced up to 50\(\times \), without loss of accuracy, leading to more compact clause sets [4]. Finally, hyper-parameter search can be simplified with multi-granular clauses, eliminating the pattern specificity parameter [8].

Paper Contributions: In the RTM, regression resolution is proportional to the number of conjunctive clauses employed. In other words, computation cost and memory usage grows proportionally with resolution. Building upon the Weighted TM (WTM) by Phoulady et al. [4], this paper introduces weights to the RTM scheme. However, while the WTM uses real-valued weights for classification, we instead propose a novel scheme based on integer weights, targeting regression. In brief, we use the stochastic searching on the line approach pioneered by Oommen in 1997 [9] to eliminate multiplication from the weight updating. In addition to the computational benefits this entails, we also argue that integer weighted clauses are more interpretable than real-valued ones because they can be seen as multiple copies of the same clause. Finally, our scheme does not introduce additional hyper-parameters, whereas the WTM relies on weight learning speed.

Paper Organization: The remainder of the paper is organized as follows. In Sect. 2, the basics of RTMs are provided. Then, in Sect. 3, the SPL problem and its solution are explained. The main contribution of this paper, the integer weighting scheme for the RTM, is presented in detail in Sect. 4 and evaluated empirically using two artificial datasets in Sect. 5. We conclude our work in Sect. 6.

2 The Regression Tsetlin Machine (RTM)

The RTM performs regression based on formulas in propositional logic. In all brevity, the input to an RTM is a vector \(\mathbf {X}\) of o propositional variables \(x_k\), \(\mathbf {X} \in \{0,1\}^{o}\). These are further augmented with their negated counterparts \(\bar{x}_k = 1 - x_k\) to form a vector of literals: \(\mathbf {L} = [x_1, \ldots , x_o, \bar{x}_1, \ldots , \bar{x}_o] = [l_1, \ldots , l_{2o}]\). In contrast to a regular TM, the output of an RTM is real-valued, normalized to the domain \(y \in [0, 1]\).

Regression Function: The regression function of an RTM is simply a linear summation of products, where the products are built from the literals:

$$\begin{aligned} y = \frac{1}{T} \sum _{j=1}^m \prod _{k \in I_j} l_k. \end{aligned}$$
(1)

Above, the index j refers to one particular product of literals, defined by the subset \(I_j\) of literal indexes. If we e.g. have two propositional variables \(x_1\) and \(x_2\), the literal index sets \(I_1 = \{1, 4\}\) and \(I_2 = \{2, 3\}\) define the function: \(y = \frac{1}{T}(x_1 \bar{x}_2 + \bar{x}_1 x_2)\). The user set parameter T decides the resolution of the regression function. Notice that each product in the summation either evaluates to 0 or 1. This means that a larger T requires more literal products to reach a particular value y. Thus, increasing T makes the regression function increasingly fine-grained. In the following, we will formulate and refer to the products as conjunctive clauses, as is typical for the regular TM. The value \(c_j\) of each product is then a conjunction of literals:

$$\begin{aligned} c_j = \prod _{k \in I_j} l_k = \bigwedge _{k \in I_j} l_k. \end{aligned}$$
(2)

Finally, note that the number of conjunctive clauses m in the regression function also is a user set parameter, which decides the expression power of the RTM.

Tsetlin Automata Teams: The composition of each clause is decided by a team of Tsetlin Automata (TAs) [10]. There are \(2 \times o\) number of TAs per clause j. Each represents a particular literal k and decides whether to include or exclude that literal in the clause. The decision depends on the current state of the TA, denoted \(a_{j,k} \in \{1, \ldots , 2N\}\). States from 1 to N produce an exclude action and states from \(N+1\) to 2N produce an include action. Accordingly, the set of indexes \(I_j\) can be defined as \(I_j = \left\{ k | a_{j,k} > N, 1 \le k \le 2o\right\} \). The states of all of the TAs are organized as an \(m \times 2o\) matrix \(\mathbf {A}\): \(\mathbf {A} = (a_{j,k}) \in \{1, \ldots , 2N\}^{m \times 2o}\) where m is the number of clauses.

Learning Procedure: Learning in RTM is done through an online reinforcement scheme that updates the state matrix \(\mathbf {A}\) by processing one training example \((\hat{X}_i, \hat{y}_i)\) at a time, as detailed below.

The RTM employs two kinds of feedback, Type I and Type II, further defined below. Type I feedback triggers TA state changes that eventually make a clause output 1 for the given training example \(\hat{X}_i\). Conversely, Type II feedback triggers state changes that eventually make the clause output 0. Thus, overall, regression error can be systematically reduced by carefully distributing Type I and Type II feedback:

$$\begin{aligned} Feedback = {\left\{ \begin{array}{ll} \text {Type I}, \;\;\;\;\;\;\;\; &{} \text {if }\;\; y < \hat{y}_i ,\\ \text {Type II}, \;\;\;\;\;\; \;\; &{} \text {if }\;\; y > \hat{y}_i . \end{array}\right. } \end{aligned}$$
(3)

In effect, the number of clauses that evaluates to 1 is increased when the predicted output is less than the target output (\(y < \hat{y}_i\)) by providing Type I feedback. Type II feedback, on the other hand, is applied to decrease the number of clauses that evaluates to 1 when the predicted output is higher than the target output (\(y > \hat{y}_i\)).

Activation Probability: Feedback is handed out stochastically to regulate learning. The feedback probability \(p_{j}\) is proportional to the absolute error of the prediction, \(\mid y - \hat{y}_i \mid \). Clauses activated for feedback are the stored in the matrix \(\mathbf {P} = (p_j) \in \{0,1\}^m\).

Type I Feedback: Type I feedback subdivides into Type Ia and Type Ib. Type Ia reinforces include actions of TAs whose corresponding literal value is 1, however, only when the clause output is 1. The probability of \(k^{th}\) TA of the \(j^{th}\) clause receives Type Ia feedback \(r_{j,k}\) is \(\frac{s-1}{s}\), where s (s \(\ge 1\)) is a user set parameter. Type Ib combats over-fitting by reinforcing exclude actions of TAs when the corresponding literal is 0 or when the clause output is 0. The probability of \(k^{th}\) TA of the \(j^{th}\) clause receives Type Ib feedback \(q_{j,k}\) is \(\frac{1}{s}\).

Using the complete set of conditions, the TAs selected for Type Ia feedback are singled out by the indexes \({I}^{\text {Ia}} = \left\{ (j,k) | l_{k} = 1 \wedge c_j = 1 \wedge p_j = 1 \wedge r_{j,k} = 1 \right\} .\) Similarly, TAs selected for Type Ib are \({I}^{\text {Ib}} = \left\{ (j,k) | (l_{k} = 0 \vee c_j = 0) \wedge p_{j} = 1 \wedge q_{j,k} = 1 \right\} .\)

Once the TAs have been targeted for Type Ia and Type Ib feedback, their states are updated. Available updating operators are \(\oplus \) and \(\ominus \), where \(\oplus \) adds 1 to the current state while \(\ominus \) subtracts 1. Thus, before a new learning iterations starts, the states in the matrix \(\mathbf {A}\) are updated as follows: \(\mathbf {A} \leftarrow \left( \mathbf {A} \oplus {{I}^{\text {Ia}}}\right) \ominus {{I}^{\text {Ib}}}\).

Type II Feedback: Type II feedback eventually changes the output of a clause from 1 to 0, for a specific input \(\hat{X}_i\). This is achieved simply by including one or more of the literals that take the value 0 for \(\hat{X}_i\). The indexes of TAs selected for Type II can thus be singled out as \({I}^{\text {II}} = \left\{ (j,k) | l_{k} = 0 \wedge c_j = 1 \wedge p_{j} = 1 \right\} \). Accordingly, the states of the TAs are updated as follows: \(\mathbf {A} \leftarrow \mathbf {A}\oplus {I}^{\text {II}}\).

3 Stochastic Searching on the Line

Stochastic searching on the line, also referred to as stochastic point location (SPL) was pioneered by Oommen in 1997 [9]. SPL is a fundamental optimization problem where one tries to locate an unknown unique point within a given interval. The only available information for the Learning Mechanism (LM) is the possibly faulty feedback provided by the attached environment (E). According to the feedback, LM moves right or left from its current location in a discretized solution space.

The task at hand is to determine the optimal value \(\lambda ^*\) of a variable \(\lambda \), assuming that the environment is informative. That is, that it provides the correct direction of \(\lambda ^*\) with probability \(p > 0.5\). In SPL, \(\lambda \) is assume to be any number in the interval [0, 1]. The SPL scheme of Oommen discretizes the solution space by subdividing the unit interval into N steps, {\(0, 1/N, 2/N, ..., (N-1)/N, 1\)}. Hence, N defines the resolution of the learning scheme.

The current guess, \(\lambda (n)\), is updated according to the feedback from the environment as follows:

$$\begin{aligned} \lambda (n+1) = {\left\{ \begin{array}{ll} \lambda (n) + 1/N, \;\;\;\;\;\;\;\; &{} \text {if } \;\; E(n)=1 \text { and } 0\leqslant \lambda (n)<1 \; ,\\ \lambda (n) - 1/N, \;\;\;\;\;\; \;\; &{} \text {if }\;\; E(n)=0 \text { and } 0< \lambda (n)\leqslant 1 \; ,\\ \lambda (n), \;\;\;\;\;\; \;\; &{} \text {Otherwise }\;\;. \end{array}\right. } \end{aligned}$$
(4)

The feedback \(E(n) = 1\) is the environment suggestion to increase the value of \(\lambda \) and \(E(n) = 0\) is the environment suggestion to decrease the value of \(\lambda \). Asymptotically, the learning mechanics is able to find a value arbitrarily close to \(\lambda ^*\) when and .

4 Regression Tsetlin Machine with Weighted Clauses

We now introduce clauses with integer weights to provide a more compact representation of the regression function. The regression function for the integer weighted RTM attaches a weight \(w_j\) to each clause output \(c_j\), \(j = 1, ...,m\). Consequently, the regression output can be computed according to Eq. 5:

$$\begin{aligned} y = \frac{1}{T} \sum _{j=1}^m w_j \prod _{k \in I_j} l_k. \end{aligned}$$
(5)

Weight Learning: Our approach to learning the weight of each clause is similar to SPL. However, the solution space of each weight is , while the resolution of the learning scheme is \(N = 1\). The weight attached to a clause is updated when the clause receives Type Ia feedback or Type II feedback. The weight updating procedure is summarized in Algorithm 1. Here, \(w_j(n)\) is the weight of clause j at the \(n^{th}\) training round.

figure d

Note that since weights in this study can take any value higher than or equal to 0, an unwanted clause can be turned off by setting its weight to 0. Further, sub-patterns that have a large impact on the calculation of y can be represented with a correspondingly larger weight.

5 Empirical Evaluation

In this section, we study the behavior of the RTM with integer weighting (RTM-IW) using two artificial datasets similar to the datasets presented in [1], in comparison with a standard RTM and a real-value weighted RTM (RTM-RW). We use Mean Absolute Error (MAE) to measure performance.

Artificial Datasets: Dataset I contains 3-bit feature input. The output, in turn, is 100 times larger than the decimal value of the binary input (e.g., the input [0, 1, 0] produces the output 200). The training set consists of 8000 samples while the testing set consists of 2000 samples, both without noise. Dataset II contains the same data as Dataset I, except that the output of the training data is perturbed to introduce noise. Each input feature has been generated independently with equal probability of taking either the value 0 or 1, producing a uniform distribution of bit values.

Results and Discussion: The pattern distribution of the artificial data was analyzed in the original RTM study. As discussed, there are eight unique sub-patterns. The RTM is able to capture the complete set of sub-patterns utilizing no more than three types of clauses, i.e., (1 ), ( ), ( 1)Footnote 1. However, to produce the correct output, some clauses must be duplicated multiple times, depending on the input pattern. For instance, each dataset requires seven clauses to represent the three different patterns it contains, namely, (4 \(\times \) (1 ), 2 \(\times \) ( ), 1 \(\times \) ( 1))Footnote 2. So, with e.g. the input [1, 0, 1], four clauses which represent the pattern (1 ) and one clause which represents the pattern ( 1) activate to correctly output 500 (after normalization).

Table 1. Behavior comparison of different RTM schemes on Dataset III.

Notably, it turns out that the RTM-IW requires even fewer clauses to capture the sub-patterns in the above data, as outlined in Table  1. Instead of having multiple clauses to represent one sub-pattern, RTM-IW utilizes merely one clause with the correct weight to do the same job. The advantage of the proposed integer weighting scheme is thus apparent. It learns the correct weight of each clause, so that it achieves an MAE of zero. Further, it is possible to ignore redundant clauses simply by giving them the weight zero. For the present dataset, for instance, decreasing m while keeping the same resolution, \(T=7\), does not impede accuracy. The RTM-RW on the other hand struggles to find the correct weights, and fails to minimize MAE. Here, the real valued weights were updated with a learning rate of \(\alpha = 0.01\), determined using a binary hyper-parameter search.

Fig. 1.
figure 1

The training error variation per training epoch for different RTM schemes.

Table 2. Training and testing MAE after 200 training epochs by various methods with different m and T.

Figure 1 casts further light on learning behaviour by reporting training error per epoch for the three different RTM schemes with \(m=70\) and \(T=100000\). As seen, both RTM and RTM-IW obtain relatively low MAE after just one training epoch, eventually reaching MAE zero (training MAE at end of training are given in the legend of each graph). RTM-RW, on the other hand, starts off with a much higher MAE, which is drastically decreasing over a few epochs, however, fails to reach MAE 0 after becoming asymptotically stable.

We also studied the effect of T on performance with noise free data by varying T, while fixing the number of clauses m. For instance, RTM was able to reach a training MAE of 1.9 and a testing error of 2.1 with \(m = T = 300\) [1]. For the same dataset, RTM-IW can reach a training error of 0.19 and a testing error of 1.87 with \(m = 200\) and \(T = 2000\). Further, for \(m = 200\) and \(T = 20~000\), training error drops to 0.027 and testing error drops to 0.027. Finally, by increasing T to 200 000 training error falls to 0.0003 while testing error stabilises at 0.0002.

To further compare the performance of RTM-IW with RTM and RTM-RW, each approach was evaluated using a wide rage of m and T settings. Representative training and testing MAE for both datasets are summarized in Table  2. Here, the number of clauses used with each dataset is also given. The T for the original RTM is equal to the number of clauses, while for the RTM with weights T is simply 100 times that number.

As seen, the training and testing MAE reach zero when the RTM operates with noise free data when \(m = 7\). However, MAE approaches zero with RTM-IW and RTM-RW when increasing number of clauses m.

For noisy data, the minimum training MAE achieved by RTM is 5.500, obtained with \(m=5000\) clauses. The RTM-IW, on the other hand, obtains a lower MAE of 5.373 with less than half of the clauses (\(m = 2000\)). The accuracy of RTM-IW in comparison with RTM-RW is less clear, with quite similar MAE for noisy data. The average testing MAE across both the datasets, however, reveals that the average MAE of RTM-IW is lower than that of the RTM-RW (2.101 vs 2.168).

6 Conclusion

In this paper, we presented a new weighting scheme for the Regression Tsetlin Machine (RTM), RTM with Integer Weights (RTM-IW). The weights attached to the clauses helps the RTM represent sub-patterns in a more compact way. Since the weights are integer, interpretability is improved through a more compact representation of the clause set. We also presented a new weight learning scheme based on stochastic searching on the line, integrated with the Type I and Type II feedback of the RTM. The RTM-IW obtains on par or better accuracy with fewer number of clauses compared to RTM without weights. It also performs competitively in comparison with an alternative RTM with real-valued weights.