Keywords

1 Introduction

‘Artificial Intelligence’ now covers a wide range of tasks such as image recognition, speech recognition, game playing, and protein folding, each of which can be performed at, near, or beyond human level. Over time the term has drifted in meaning, away from a ‘thinking machine’, toward systems that often can only be applied to a single task, do not improve without further training, and take large amounts of resources to train and run. For example, GPT-3, a large language model, is estimated to have cost over 4.6 million dollars to train [19]. These models can be opaque, difficult to interpret, and unable to explain why a particular prediction was made, or unable to provide any guarantees in failure scenarios. Predicate Logic, on the other hand, is capable of robust and consistent decisions. One such predicate logic system, CYC [18], aims to encode all common human knowledge in a knowledge graph. This means CYC has the limitations of predicate logic, one being that all axioms (in the knowledge graph) be true and consistent, otherwise false statements can be derived. Another approach is Non Axiomatic Reasoning. The Non Axiomatic Reasoning System (NARS) performs reasoning that does not assign an objective value of truth to a statement, but instead assigns a subjective value. This subjective value is not fully trusted, and is revised over time as new information arrives. NARS has the advantages that it 1) can cope with holding conflicting information in its knowledge base 2) can explain predictions, 3) requires less data for inference 4) explicitly implements logic choosing which concepts to remove rather than randomly forgetting.

While Open NARS for Applications (ONA) was designed with reasoning in mind, we choose to investigate its usefulness, and resource consumption, on perception, and in particular speech recognition. This is because as Peirce stated “abductive inference shades into perceptual judgement without any sharp line of demarcation between them." [23], and advances in understanding one may shed light on the other. The integration of deep learning and logic reasoning is an open-research problem and it is considered to be the key for the development of real intelligent agents [21]. We narrow the focus of this paper to the dimensionality reduction, and logic needed to convert auditory sensory data into category labels, and the resources required in the Open NARS for Applications (ONA) software platform. First we give background, from the recent discussion around the definition of intelligence, and then how our human nervous system fused from 2 independent systems, perhaps leading to different characteristics of it. We then give a limited literature review, and then explain our method and experiments. In the last section we give and discuss our results.

2 Background

Until recently, and possibly still, industry (and maybe academia too) are interested in whether new tasks can be learnt by AI, and if so, can they be sold profitably to consumers. The developers are under no obligation to consider the environmental impact or safety concerns. That said, some do by choice, but there is little compulsion from a social or regulatory point of view. This is partly due to deep learning being “unreasonably good" [25] and partly due to no other known way of achieving the same level of performance. A focus on resource consumption was created by adding it explicitly into the definition of artificial intelligence [30]. Invitations to comment on this definition produced much discussion [31]. In this discussion it was pointed out that industry has existing finite resource limitations [17], which is true, however for most leader board tasks, resource consumption is not taken into account. It also appears that governments are hesitant to impose resource limitations on industry. A resource limitation is of interest as it prevents brute force approaches and opens up the possibility of investigating how fewer resources can be utilised over time. Brute force approaches can also encode an entire domain space, further limiting the conclusions that can be drawn. If a method requires fewer resources over time, as a task or operation is repeated, then this suggests a deeper (or perhaps more precise or over-fitted) understanding of that task or operation, which may be of interest in the investigation of intelligence. Wang’s definition of intelligence [30] separates skill (e.g. playing chess) from intelligence, and contains an assumption of insufficient knowledge and resources (AIKR). Under this definition learning a new skill to the same level of ability as some other method but with fewer resources or an insufficient understanding of the world (i.e. imperfect knowledge) is advantageous. Thus intelligence and skill are separate concepts.

The focus of this paper is on the cognitive processes that underlie speech recognition. It is assumed that these are similar to the cognitive processes that underlie other forms of perception. However, as stated earlier, the line between perceptual judgement and abductive inference has no clear demarcation [23]. This may be because the physiology underpinning these functions has different origins. Genetic patterning studies suggest that the ‘blastoporal nervous system’, which coordinates feeding movements and locomotion, and the ‘apical nervous system’, which controls general body physiology, evolved separately in our ancestors more than six hundred million years ago [2, 28], and subsequently fused. This may help explain why we (humans) are still aware of differences in different parts of our nervous system, being aware of our cognition around feeding and locomotion, but have little to no awareness of our ‘apical nervous system’. There are arguments that perception and cognition are unified [3, 6, 10, 13, 16, 26, 27], and arguments against, that is, for modularity [9, 22]. The debate between modularity and unified is beyond the scope of this paper. However, the important point for the purposes of this paper is that the mechanisms of perception are not fully available to us. We do not know, for example, how we recognise objects or how we understand speech. Speech is temporal in nature and involves nuanced differentiation between acoustically similar sounds, (for example b in bright, and f in fright).

Model performance over the last few decades has steadily improved, however it is computationally expensive. Current state-of-the-art models rely on a low level acoustic model, followed by a language model. The acoustic model converts a sound wave into an encoded representation of a sound, and the language model gives the probabilities of the next word, given the last few words, along with the acoustic encoding. As said, language models, like GPT-3 require a large amount of data to train, which conflicts with how children learn a new word with very few examples [5]. Wang’s definition of intelligence [30] is based on the idea that intelligence is about making the most of the resources that are available and that it is not always possible to know everything that is going on in the world. So, someone who is able to learn new skills quickly and efficiently, even if they don’t have a lot of knowledge about those skills initially, would be considered intelligent. One approach to understanding the mechanisms of intelligence is to consider the different ways in which it can be measured. One way to measure intelligence may be by looking at someone or some system’s ability to learn new skills, and then measure the quality of that skill, as well as the energy consumed to learn it and perform it once learnt. Speech recognition is one such potential skill.

Generally large dimensionality reduction is needed to convert perceived inputs into symbols that logic can be applied on. One approach [4] is to cluster the inputs in the feature space before similarity and difference are calculated and used as input into a deep network that is trained. Another approach is to pre-process with a DL model specifically trained for that modality, i.e. YOLO for computer vision, and use the generated labels. The generated labels form a lower bound on the resolution of the logic system, i.e. if the labels are ‘dogs’ and ‘cats’, it would be difficult for a logic system to learn of a new breed of dog. If the logic system uses scalar output features of a DL model, e.g. the bounding box of the cat, x by y pixels, and if only far away cats are seen, and then later a closer (and therefore much bigger) cat is seen, scaling issues can be created, as the system may not see the full range of sizes immediately, and needs to re-calibrate previous observations when the scale is readjusted on seeing a much larger, or smaller example. In speech these challenges are exhibited in the form of the dimensionality reduction from 16k samples per second to \(\sim \)2 words per second, with uneven speaking speed. Speech recognition has traditionally used labelled data sets which cost USD50/hr to hand label, limiting the training data set size into the 10,00 h or less range. The resultant systems have low generalisability with many recent state of the art systems reporting <5% word error rates, which collapses into the 30–40% range when used on other, but similar datasets [24]. The exact costs of training models such as Whisper [24] with 1.6 billion parameters, are unknown, but the 175 billion parameter GPT-3 (109\(\times \) larger), also by OpenAI, is estimated at 4.6 million USD. If the training costs were a constant multiplier of the number of parameters (they are not), the training of Whisper could be in the order of magnitude of 40k USD. The training data set of Whisper was 168k hours of 16 KHz speech. This equates to a data set size of 19 TB, approximately half the estimated size of GPT3 about 45 TB of training data. The recent success of attention in other domains has also been applied to speech. Andrade et al. [1] developed a 202K parameter neural attention model, we will refer to as ANAM, which also targeted at the Speech Command dataset.

3 Literature Review

In the 1970’s speech recognition relied on hand crafted features. This changed as end to end differential systems were developed and new SOTA were reached [24]. These systems lack interpretability, while not important for speech recognition, are of interest if the features triggering decisions can be exposed and validated. With concept whitening [7] it is possible to concentrate (grounded) meaning in single neurons to aid interpretability, but requires category labels, which may not be available at training time, and adds complication as category labels “need to address topics like the representation of concepts, the strength of membership in a category, mechanisms for forming new concepts and the relation between a concept and the outside world" [32]. Another approach is Deep Logic Models which integrate deep learning and logic reasoning in an end-to-end differentiable architecture [21]. This work leads onto Relational Reasoning Networks R2Ns [20] which perform relational reasoning in the latent space of a deep learner architecture. However they suffer an explosion in memory needed as the number of possible ground atoms grows polynomially on the arity of the considered relations. This underpins the useful implication of the AIKR. Shanaha [26] explores perception of objects via computer vision, with abduction, but does not describe how resource intensive the work was or if resource limits were reached. Johansson [14] investigates learning match to sample and the relationships more, less and opposite. Noting the advantages that AGI has, allowing us to experiment with an agents internals, and giving the example where NARS layer 6 (variables for generalisation) is shown to be needed for the work Hammer [11] does on a system designed to only process data from perception, i.e. has no predefined knowledge. Generalised identity matching, where a new example is matched to a reference sample has been shown to be possible in NARS [15], and further that the derived identity concept could generalise to novel situations. In unpublished work Durisek [8] postulated speech recognition leveraging phonemes may be possible with NARS.

4 Method

For simplicity we attempt to identify single whole spoken words, which has been the focus of much research. We use a standard data set, the Speech Command v2 [33], which contains 35 single word commands, 0–9, back, forward and other confusing words (bird, bed). Each word had over 3000 recordings, each of 1 s in duration or less. As in Whisper [24], we take 16 bit, 16 KHz audio, on which 80 bin MEL (logarithmic) spectrum was calculated every 10 ms. This produced 8000 (80\(\,\times \,\)100) energy intensity values per second, which were normalised. Utterances shorter than 1 s were padded. This reduced the input dimension from 16000 to 8000, and is a standard pre-processing step in speech recognition. These 8000 values needed to be restructured to be passed into ONA. Data was presented to ONA in the form of NarseseFootnote 1 statements. As a simple example, we encode three examples as Narsese instances, A, B and C with n properties each. The strength of the property relationship to the instance was encoded in the truth value, i.e. a property with a strength of 0.9 would be encoded as:

$$\begin{aligned} <\{A\} \rightarrow [p1]>. \%0.9\% \end{aligned}$$
(1)

meaning “‘{A}’ has the property ‘p1’ with strength of 0.9". It was then asserted that {A} is a LABEL and {B} is a LABEL. e.g.

$$\begin{aligned} <\{A\} \rightarrow LABEL >. \end{aligned}$$
(2)

And then the system was queried to see if C was labelled correctly:

$$\begin{aligned} < \{C\} \rightarrow LABEL>? \end{aligned}$$
(3)

After a grid search we set the number of labelled examples per class to 2 (+ unlabelled example = 3). We note this is the smallest number that allows similarity to be exploited. With this setup and synthetic and real data we attempted to answer the following questions:

  • RQ1: Can non axiomatic reasoning, which can cope with conflicting information, be leveraged to perform speech recognition?

  • RQ2: Is there a computational or performance?

For a baselines we used 1) general speech recognition pre-trained Whisper models, and 2) the earlier mentioned ANAM model. We expected these to produce state-of-the-art results, at the cost of larger computation.

Experiment 1 NARS, Computational Complexity. We expressed data in the same manner as above, encoding each real world utterance as 8000 properties, using the energy in each bin as the strength of a property as Narsese statements. Energy values below 0.5 posed a problem, as they expressed absence of a feature in the data, e.g. the word ‘Moo’, should not have high frequency ‘s’, or ‘t’ sound in it. To enable ONA to track the absence of something, we negate the property name, i.e.[mel16x9] becomes [NOTmel16x9], and subtract the truth value from 0.5, so that a low truth value 0.1, becomes 0.9. E.g.

$$\begin{aligned} <\{U_1\} \rightarrow [mel16x9]>. \%0.1\% \end{aligned}$$
(4)

is replaced with

$$\begin{aligned} <\{U_1\} \rightarrow [NOTmel16x9]>. \%0.9\% \end{aligned}$$
(5)

We took 3 random utterances of ‘one’ (from the 3893 possible), generated the 8000 values for each, then encoded these as Narsese statements. We then asserted \(<\{U_1\} \rightarrow one>.\) For utterances 1 and 2, and then queried ONA to see if utterance 3 is similar to the label \(<\{U_n\} \rightarrow one>?\). All performance tests used a 64 GB AMD Ryzen 5 3600 6-Core Processor running Ubuntu (no GPU).

Experiment 2 Nalifier, NARS, Synthetic Data. We take the same method as experiment 1, but this time pass the statements into a python pre-processor, Nalifier.py [12], that suppresses certain Narsese statements, and synthesises other Narsese statements, which are in turn passed into ONA. The Nalifier has several functions, if the statement received consists of an instance property statement e.g. \(<{utterance_n} \rightarrow [property_p]>\). It collects all the properties for this new instance, all the properties for all other instances in its memory and starts comparing them to find the closest. If an instance is found that the current instance is similar to, it synthesises and emits new narsese. The new narsese is passed into ONA (or more specifically NAL, the executable of ONA), and interpreted. The success criteria is the same as experiment 1, we check to see if the unknown instance is labelled correctly.

Experiment 3 - NUTS. We now introduce ‘NUTS’ : raNdom dimensionality redUction non axiomaTic reasoning few Shot learner for perception. NUTS consists of four modules, dimensionality reduction, conversion into narsese, a narses preprocessor (the Nalifier), and open NARS for applications (ONA). We used a random projection without sparsity, to reduce dimensionality, specifically we pass the input 16k samples through MEL encoding, producing 8000 values. These 8000 values were multiplied by a randomly generated \(8000 * D\) matrix, reducing the dimensions to D. These D values were then used to generate narsese as before, which is passed into the Nalifer which filers and generates narsese, which is passed into ONA. Each class was tested in turn with the negative classes consisted of the remaining 34 words in the speech command dataset. The number of learning examples of each word could be varied, along with ONA’s setting for the size of the AIKR, the size of the reduced dimensionality spaceFootnote 2, and the number of repeats. The matrix used for reduction was re-generated before each run. Success was measured as the proportions of runs where the correct “is a" relationship is identified.

5 Results

Experiment 1 NARS, Computational Complexity. As mentioned, baselines were OpenAIs whisper model, and Andrade’s et al’s ANAN. Whisper was tested on 100 random utterances from each of the 35 words in the Standard Commands data set, comparisons were case insensitive excluding punctuation. As seen in Table 1, Whisper tiny model took an average of 0.8 sec per inference (including encoding) with a performance of 58%. ONA was unable to accurately identify the unknown utterance as being similar to anything in memory. This may have been because the full structure of the speech was not exploited, but we wished to avoid manual feature engineering. Analysis of the derived statements showed that the instances were considered similar to the properties rather than the instances, this was unexpected.

Table 1. Performance & Baselines: Whisper Tiny, Large, and ANAM

Experiment 2 Nalifier, NARS, Synthetic Data. The Nalifier took considerable time to execute, to load and ’train’ 2 instances with 2000 properties each, took 95 min. To load, encode and perform inference on a new example took an additional 43 min. This version of the Nalifier contains a \(O(n^2)\) algorithm which executes each time a new property was observed for an instance. After 3 instances, each with 2000 properties were added into NARS, (first pre-processed by the Nalifier), NARS successfully determined that instance C, the unlabelled instance, was similar to instance A. This showed that speech recognition is possible with NARS, and that the Nalifier is needed. We did not attempt 8000 properties, or measure accuracy due to execution time.

Experiment 3 NUTS. We were surprised randomly reducing dimensions worked, even for small numbers of training examples. For these experiments the best performance was obtained with 4 dimensions, when the unknown class was labelled 64% of the time, compared to 2.8% for random performance, see Table 1. This compared favourably with Whisper Tiny’s 58%Footnote 3, but far below the ANAM’s state of the art 94%. Training was label and compute efficient, only needing 2 training samples per class, and inference time was 0.02 sec (including encoding and dimensionality reduction), far below that experiment 2’s 43 min, showing that dimensionaility reduction is the source of the computational efficiency.

We conclude that perception, specifically speech recognition is possible with NARS. However we note performance collapsed certain words such as Bird, and Bed, yet ANAM’s confusion matrix shows it is possible to distinguish them. This may be due to a limitation of NARS, or information loss in the dimensionaility reduction. Figure 1 shows the overall performance of a random generated matrix, a random word, and reduced dimensions 2–10, repeated 3500 times (100 times per class). Figure 2 shows performance increases with the number of examples, raising from 64% at 2 examples to 90% at 20.

Fig. 1.
figure 1

Accuracy as a function of the reduced dimension embedding. Number of examples per class = 2.

Fig. 2.
figure 2

Accuracy as a function of the number of examples. Reduced dimensions = 4.

6 Discussion

RQ1: Can non axiomatic reasoning, which can cope with conflicting information, be leveraged to perform speech recognition? Yes, we demonstrated that NARS (along with the Nalifier and dimensionality reduction), can interpret speech data in a meaningful way, obtaining 64% accuracy with only 2 training examples on a 35 class problem.

RQ2: Is there a computational advantage? A crude comparison of inference time on CPU suggests they are in a similar order of magnitude. But as Whisper runs efficiently on GPU, and NUTS is a mix of C and Python, a strict comparison of counts of each operation type, could not be completed in the time, and it is left as further work.

We started by discussing the term ‘Artificial Intelligence’ and resource consumption. While we did not produce a system with same or better performance with fewer resources, if we had done so would it be more ‘Intelligent’? We would argue it would not, as the catalyst is not part of the described algorithm. We suspect building ‘intelligent’ systems (as in ‘thinking machines’) will remain elusive until the terms skill and intelligence are dis-entangled, and the catalyst for improvement is isolated and automated.