1 Introduction

In the past twenty years, many bio-inspired computing techniques have been developed on toy datasets and applied to expensive small real datasets. Yet each year, our computers get faster, and we have access to more and more data collected from the real world. In the past decade we have seen that our access to computational resources has increased faster than the data we would like to analyze. An example of this is the success of deep learning models in image processing. The convolutional neural network (CNN) architecture was introduced in 1989 [1] yet it was from approximately 2006 onwards that their use in image processing became widespread.

This paper reviews a body of work from the past 20 or so years, focused on the initial development of techniques and some examples of their later application to human derived signals. The literature of such work is huge, this paper concentrates largely on the outputs of the late Information Engineering group at the University of New South Wales in Sydney from the 1990s (mostly Bio-inspired computing technique development) and later work up to the present in the Human Centered Computing research group at the Australian National University with applications and some further technique development. The objective is to provide a longitudinal survey which attempts to retain some breadth in topic areas as well as depth in some topics, and does so by the artifice of a group-biographic survey. The paper closes with a proposal for automated data analysis which synthesizes the bio-inspired tools discussed into an automated experiment analysis methodology.

2 Bio-inspired Computing Tools

Bio-inspired computing is related to artificial intelligence, machine learning and so on. The key is the inspiration for learning algorithms from the world, particularly from biology. In this paper we will concentrate on three main models, neural networks (including deep learning), fuzzy logic, and evolutionary algorithms. We note that the terms “soft computing” and “computational intelligence” are near synonyms of bio-inspired computing. Neural networks are a computational model based on models of neurons in the brain, and the way these simple computing elements are interconnected to produce complex computations. In the literature they are sometime referred to as artificial neural networks. It is particularly worth noting in the context of neural networks that these are computational (or engineering style) approximations and simplifications which have been found to be useful computationally, but do not attempt to directly model the real behaviour of neurons and biological nervous systems. Neural networks are in general local search techniques. Fuzzy logic can be seen to model linguistic reasoning, with multi-valued set memberships and linguistic variables. Evolutionary algorithms mimic some of the properties of biological evolution to perform computations in a form of randomized but guided global search. The combination of local and global search is slow, but can provide results better than either, in hybrid algorithms.

In the following sections we briefly review some work in these areas, some of which we believe can and should have application to modern large problems. Our confidence comes from two sources, firstly the success of the CNN architecture, and the recent implementation of the “Gedeon method” [2] in the H2O [3] deep learning package – currently still somewhat slow when applied to large datasets yet its implementation indicates an independent confidence in both the usefulness of the technique and its likely applicability on tomorrow’s faster hardware.

2.1 Neural networks

The back propagation neural network algorithm was introduced in 1986 [4] (see Fig. 1).

The neural network is trained by input of training patterns, then the weights on each link are used to modulate the inputs, these are summed at the hidden neurons and a non-linear activation function is applied (often the sigmoid or logistic function). When values reach the output neurons, the value is compared to the desired value for this training input, and the difference serves as a ‘back propagated’ error signal to make small modifications to the weights in the preceding layer. These modifications can be used to infer error signals to earlier layers of weights working back towards the input. After much presentation of training data, the network approximates the function linking the input to the output. This is an expensive process, particularly if the input is large and the network has many layers. As mentioned earlier, it is only in the last decade that deep learning on large data has become accessible to those without a handy supercomputer.

Fig. 1
figure 1

Simple neural network, multiple layers of hidden neurons are possible. If there are ‘many’ this becomes a deep learning neural network

2.1.1 Gedeon method

Data encoding and feature selection for the training of back-propagation neural networks has two basic principles: i) to avoid encrypting the underlying structure of the data, and ii) to avoid using irrelevant inputs. The paper [2] used weight matrix analysis and functional measures on two noisy real data sets, and a novel aggregation technique was introduced

$${P_{jk}} = \frac{{\left| {{{\rm{w}}_{jk}}} \right|}}{{\sum\nolimits_{r = 1}^{nh} {\left| {{{\rm{w}}_{rk}}} \right|} }}.$$
(1)

where P is the contribution of a hidden neuron to the output. This can readily be calculated for contributions of the input to the first hidden layer, or composed backwards for any number of layers, for example for a 2 layer network:

$${{\rm{Q}}_{ik}} = \sum\limits_{r = 1}^{nh} {\left( {{P_{ir}} \times {P_{rk}}} \right)} $$
(2)

2.1.2 Bimodal distribution removal

Cleaning noisy training sets can improve generalization, but many methods perform well on artificially noisy data, but not so well on real world data where distinguishing between rare data points and just noisy data can be difficult. A statistically based method [5] performs well on such data and provides a stopping criterion to terminate training (see Fig. 2).

Fig. 2
figure 2

Error distribution by input patterns, a few epochs training (left) and 500 epochs (right)

2.1.3 Adding noise

While we have access to more and more data, sometimes we are still short of labeled data, where the class label or quality value and so on is assigned using an expensive process, such as expert human intervention or the collection process of the raw data is itself expensive, such as extraction petroleum reservoir core samples or values for a geographical information system ‘pixel’. Experiments have shown that a crude form of simulated annealing with decreasing amounts of noise works well [6], and that varying amounts of normally distributed noise also leads to good results [7].

2.1.4 Explanations

The use of neural networks for prediction is commonplace, for example in student grade prediction [8], but in domains of human interaction, this is unsatisfactory. Thus, for example, if we predict the weather tomorrow, we expect this to have limited effect on the weather. Yet if we predict a student’s grade, we expect some action to (attempt to) change that prediction if the student is unhappy with the predicted result or even the converse may occur when a feeling of complacency may lead to less effort and a lower mark. Producing explanations based on causal modeling [9] can produce explanations for the neural network conclusions, noting that correctness here is in matching the network outputs and not the underlying class distribution, a distinction which will become significant later in this paper. We note that the approach was expensive computationally, and was applied to clustering based subsets of the data, and produced rules of the form (see Fig. 3):

Fig. 3
figure 3

Explanations for a Distinction grade using causal index with characteristic patterns

We note that the negative association of h2 with a Distinction grade proved correct. There is recent related work in peer marking [10].

2.1.5 Cascade networks

The cascade correlation network [11] is a powerful training algorithm which constructs networks one neuron-layer at a time. Each new neuron has connections from the inputs and all previous neurons which allow arbitrarily complex learning in the last neuron in principle. In practice, to keep computational costs manageable, each neuron-layer’s weights are frozen before the next is added. The networks therefore freeze inaccurate early learning and require many layers to unlearn such early learning, causing these networks not generalize well on regression and some classification problems. An alternative approach using RPROP to train the networks and low learning rates (to reduce weight updates rather than freeze weights), results in networks which use fewer hidden neurons and generalize better than those produced by the original cascade correlation algorithm [12]. An extension to insert small ‘towers’ of cascaded neurons as single higher order neurons reduced the computational cost close to (then) tractability – we are investigating these as deep learning feature extractors for non-visual data (see Fig. 4).

Fig. 4
figure 4

Exponential growth in weights is reduced using towers – higher order neurons

2.1.6 Extreme Learning Machines

The extreme learning machine (ELM) is a method of training shallow feed forward networks with high efficiency [13]. The key notion is that input weights are not trained, rather they are set at random and fixed. Then, since the training set and input weight matrix is fixed, the output weights can be calculated directly, generally by the Moore-Penrose pseudo-inverse. The number of hidden neurons need to be raised significantly, yet an increase from 20 hidden neurons in back propagation can be replaced by 400 neurons with ELM training and still achieve a 10-15 fold increase in speed, with small drop in accuracy. Each ELM hidden neuron can be considered to come with some initial (random) fixed functionality and the training of the output weights is a weighted selection from the available menu. A small ELM network could be used as a higher order neuron, and some initial work has been done [14] (see Fig. 5).

Fig. 5
figure 5

Cascade of ELM networks

Such sequential processing may be useful in simpler settings, where two outputs are to be predicted, but the prediction of one when added as an extra input can improve the prediction of the second [15].

2.2 Fuzzy logic – Fuzzy relational maps, interpolation, hierarchical fuzzy, FOE

The term ‘fuzzy’ is unfortunate; the name intended to signify a mathematically rigorous method to deal with uncertainty. Unfortunately, the normal English meaning of the word is close to the opposite of this. Thus, much of the initial successes in this field originate in non-English speaking countries, particularly Japan.

Fuzzy logic [16] approximates human linguistic reasoning. When asked to define ‘tall’ we can clearly identify both tall and short people, yet naturally do not consider the two individuals on either side of an arbitrary height to be unambiguously tall or short when their heights may differ by only a few millimeters. Thus, values of set membership between 0 and 1 are used. These latter individuals may have fuzzy set membership values of 0.49 and 0.51 reflecting their similarity in height and retaining the ability to convert to classical sets by the round function (also called defuzzification). The second main component is fuzzy linguistic variables, such as the example of ‘tall’ used already.

In the previous section, neural networks return values between 0 and 1, as do fuzzy sets, and probability. It is important to differentiate between these. A neural network’s outputs are not probabilities unless the network has been explicitly trained with output values representing probabilities [17]. The values between 0 and 1 in fuzzy logic represent vagueness and are possibilistic, unlike probabilistic values of 0 to 1 which model ignorance. A simple dichotomy: probabilities add to 1, while possibilities need not do so, though for convenience many fuzzy algorithms impose the same condition. Figure 6 demonstrates the former, we not the example shown has values of 0.4 and 0.15 which add to less than one, while the converse is also possible. A key limitation of fuzzy logic is the exponential growth in number of rules for full (dense) rule bases. This had limited applications to control settings where there can be only a few input parameters but with complex behaviour. See equation 3, where k is the number of input dimensions and T is the granularity of the rule base – the number of fuzzy linguistic terms per input dimension

$$\left| R \right| = O({T^k}).$$
(3)
Fig. 6
figure 6

Tradition (‘crisp’) sets and fuzzy sets

2.2.1 Fuzzy Equivalence

Fuzzy systems and neural networks are (most models) universal approximators in that they can approximating any continuous function to arbitrary accuracy. These techniques therefore share approximation capabilities. Fuzzy systems with if-then rules have an advantage of easy interpretability, and neural networks can adapt their learning to improve performance on a training data set. It has been shown that several fuzzy controllers implement radial basis function neural networks [18], a kind of feed forward network with radial basis activation functions rather than the sigmoid ones described above. Such evidence suggests that combination of these techniques in hybrid systems will not lead to a loss of approximation ability.

2.2.2 Fuzzy Cognitive Maps

The FCM [19] is an extension of cognitive maps with directed links with weights which represent degrees of causality, which extends the notion of the traditional cognitive map decision-making aid. The causal links between concepts are straightforward for people to define, but produce static maps. The weighted links with a learning regime allows for modeling of dynamic situations. The learning process uses differential Hebbian learning and event sequences with an initial static map to learn the weights which best reproduce such training sequences [20].

2.2.3 Qualitative Modeling

Sugeno and Yasukawa’s (SY) qualitative modeling [21] creates a fuzzy rule base (fuzzy IF–THEN rules) from input–output data thus importing one of the benefits of neural networks, and assigns linguistic labels to fuzzy sets in the rule base. Subsequently, some properties such as the construction of trapezoid membership functions, rule projection from training data, selection of important variables and finally parameter identification were improved [22], as was the cluster search for rules projection [23].

One of the advantages of the SY method is that only the required rules are produced, and hence does not produce a dense rule base, potentially reducing the combinatorial explosion. A concomitant disadvantage is that if there are test samples with values ‘between’ the rules, the result is not guaranteed to be well defined, unless a sparse fuzzy technique such as fuzzy interpolation is used. Another approach to sparse fuzzy system generation is the use of clustering in the output space with projection to each input dimension to produce one dimensional input clusters which are merged to produce fuzzy rules [24].

2.2.4 Fuzzy Interpolation

These techniques are all descendants of Kóczy and Hirota’s linear interpolation technique (KH interpolation, based on α-cuts) [25]. Fuzzy interpolation reduces the computational complexity by reducing T in equation 3. The basic notion of fuzzy interpolation is that antecedents (inputs) between two antecedent fuzzy sets of a pair of rules will produce consequents (outputs) between the consequent parts of these two fuzzy rules. This is plausible only if we assume there are no discontinuities and sudden changes such as in symbolic spaces. Since we use fuzzy logic to model the real world which is generally continuous in nature, this is a reasonable assumption (as it is for neural networks and for evolutionary algorithms).

KH interpolation had its drawbacks, which could lead to non-interpretable conclusions due to the way the two core points were interpolated for triangular observations with trapezoidal rules, in some cases. Further the technique required convex and normal fuzzy sets. There have been a number extensions, we survey some different approaches: the use of a spatial geometric representation allows us to avoid the requirements of convex or normal fuzzy sets, and guarantees interpretable conclusions in all cases [26] (see Fig. 7); an extension on the latter technique using the interpolation of the semantics of the rules and their interrelationships to guarantee the direct interpretability of the conclusions and piecewise linearity for triangular membership functions [27]; finally, returning to modified α-cut interpolation (MACI) methods which retain the low computational complexity of the original KH method, firstly retaining vector description of the fuzzy sets as characteristic points, coordinate transformation, and considering the fuzziness flanking information in the input spaces at the conclusion leads to efficient interpolation of fuzzy rules for multidimensional input variables [28]; and a generalization of characteristic points for different a-cut levels with normalization and aggregation functions leads to always acceptable conclusions [29].

Fig. 7
figure 7

Determination of Ai’’ and B’’ by geometric solid cutting

2.2.5 Hierarchical Fuzzy Systems

Another approach to reducing the exponential growth in number of rules is hierarchical structuring of the rule base so that only some rules are used at a time and the rules used reduce the choices of subsequent rules. That is, in effect reducing k in equation 3.

The first example of a hierarchical fuzzy system is the handcrafted rules for helicopter flight control [30]. A general problem in replicating that work has been that in many cases the border conditions between hierarchical branches require bridging rules which can often require many or all variables from both branches and hence obviating the benefit of the hierarchical system. The use of fuzzy relations [31] or co-occurrence relations [32] based on fuzzy tolerance (compatibility) relations and fuzzy similarity (equivalence) allows for extension of search based on hierarchical co-occurrence of words and short phrases in documents. This approach thus is based on the document structure and on the semantic interrelationships of words. Equivalent structuring information is not always or even often available. Of more general utility, hierarchical rule bases can be constructed directly from data [33]. This technique constructs simple single variable at each level hierarchical rule bases and then prunes the set of hierarchical fuzzy rules to directly reduce the rule base, see Fig. 8 – the 12 rules are reduced to 5.

Fig. 8
figure 8

Pruning of simple hierarchical rule base from 12 to 5 rules

When a whole sub-rule bases have the same output class (*) then it can be pruned by moving the class label up the hierarchy. In some cases (**) a value could be interpolated and also removed.

2.2.6 Fuzzy Signatures

Extending fuzzy sets to vector components provides some structuring advantages, but allowing components of the vector to be vectors themselves produces a tree structure which by its nature produces hierarchical structures [34] we can build ab initio. Figure 9 shows a possible data structure for a fuzzy signature based on doctors’ assessments.

Fig. 9
figure 9

A basic fuzzy signature structure A S and an instantiation A 3 . Depending on diagnosis, the nausea element (say) could be a vector of measured values even hourly

The example in Fig. 9 is illustrative, in a medical ward, instead of the qualitative components; the vector element might be temperature and the values to be specific measurements. In practice, such signatures are meant to be flexible, so how do we compare patient A 3 in Fig. 9 above, with patient A 4 (not shown) whose fever was just measured once? We do this by aggregating the values. Some possible simple functions could be the maximum value (so A 3 ’s fever value is aggregated to 0.8), or the average, in some cases the minimum and more complex functions are possible. In general we would like to learn the aggregation functions which best suit the data we have [35], which achieves good results in SARS diagnosis. Beyond the medical domain, fuzzy signatures have been used in robot communication with the introduction of a codebook to implement implicit fuzzy communication [36].

2.2.7 Fuzzy Output Error

The final fuzzy approach we will discuss is to use the notion of fuzzy values being possibilistic and about uncertainty in learning algorithms. That is, when output values are ‘close enough’ we should not be training to improve those output values but rather be training those which are not correct at all. This leads to the notion of fuzzy output error [37] which permits choices in the shape of error functions and allow us to tune error functions for specific applications. For example, in the medical domain false positives (has condition but algorithm classifies as healthy) are very unfortunate for serious conditions.

From Fig. 10 we can see that in general it appears sensible to include the fuzzy values of classification result values between 0.5 and 1 rather than just the rounded value. A sensible extension is to learn the shapes of the membership functions from data [38], using squashing functions with sigmoids approximating the fuzzy membership functions so gradient descent techniques such as neural networks can be used.

Fig. 10
figure 10

Superimpose a fuzzy trapezoidal (or triangular) membership function on the cross diagonal for illustration purposes

2.3 Evolutionary Algorithms and Hybrid Techniques

Evolutionary algorithms use fitness functions to determine the likelihood that an individual in the population of solutions is used to help produce the next cycle of candidate solutions. Operators such as cross-over and mutation mimic biological mechanisms which combine information from two ‘parent’ solutions to create a new candidate solution and introduce diversity (‘mutation’ increases the possible search space, whereas ‘cross-over’ optimizes the search of the space already available to candidate solutions. At a practical level, if we have a set of labelled data we would initially consider a neural network solution, whereas if we have a function to evaluate the quality of solutions we would initially consider an evolutionary algorithm. Of course, functions can be used to generate data, and data sets can be used instead of a function by aggregating the errors and invert that cost function as a fitness function.

In principle, the individual elements of candidate solutions which together form the solution (and form the chromosome) should be independent and continuous, though in practice evolutionary algorithms are sufficiently robust to provide good solutions in without independence or continuity [40]. For example, timetabling is inherently a dependent problem as a change in one timeslot inevitably causes changes to some or all of the other timeslots [41]. In such situations clustering of the data [42] is useful to create independent groups of points. A survey can be found in [43]. Hybrids of neural networks and evolutionary algorithms were mentioned earlier, other hybrids are possible, for example fuzzy and evolutionary [44], or ensembles [45].

3 Human Behaviour

The significance of human centered computing research rests on the ability to be useful in human contexts – to extract humanly useful information resources and use these to enhance computer and software interactions with human beings. We review a few approaches in this vein, ranging from automatic camera control based on human behaviour, face recognition, reading on large and small screens, and recognizing gendered differences in behaviors. This latter serves as a proxy for all of the other possible differences (education, culture, first language and so on) which lead to the need to consider sub-groups of people in such research.

3.1 Camera Control

The complexity of dealing with human beings in experiments is well illustrated by the study in which it was shown that the sequence of experiments has an effect on the results [46]. Normally, the purpose of multiple sequences is to determine whether there is a ‘first seen’ issue which benefits that technique. Subsequent work showed improvements in pan-tilt-zoom camera control from a natural combination of head movements and eye gaze [47], leading to a simple model where automatic eye movements of the operator are used to control the camera [48], or using mixed media [49]. These studies illustrate that application of advanced processing is best done behind the scenes and to aid users rather than force them to make choices or to learn new behaviors. With the computational power at our disposal and the bio-inspired computing tools previewed earlier, this could be done more broadly.

3.2 Face Recognition

For people, faces are important, thus a survey of this nature no matter how brief would be remiss without considering computational approaches. There are two aspects, firstly the recognition of faces varying in pose, illumination and expression [50]. The next steps are expression recognition, emotion recognition (including by other techniques [51], emotional state [52]), followed by group affect [53].

3.3 Reading and eLearning

Reading is a learnt behaviour and does not come naturally to humans. Nevertheless, due to the substantial amounts of time spent learning to read and practicing reading, it becomes close to a natural human behaviour for a large part of the world’s population, except for some with special needs [54]. For business, each group of people can have specific needs to address [55].

During reading, we can use bio-inspired and information retrieval tools to discover information on the texts being read by observing and analyzing the readers’ behaviour. We can extract the stress level embedded in the text scenario [56], the reading comprehension of the reader [57], and the document category [58].

On small screens, we can discriminate search behaviour [59] and mobile learning preferences [60]. For eLearning in general, we can discover how the mode of presentation of the texts read affects learning outcomes [61].

3.4 Gender

We can recognize differences in behaviour in reading by gender [62], and differentiate English as a first language readers from second or later language readers [61]. We can differentiate responses to face replacement in videos [63]. Such evidence supports our contention that results from human studies are contextual and we must use the large volumes of data available to ensure we cover the breadth of possible populations rather than collect ever more data from the same populations. As mentioned earlier, we consider the work on gender to be a general proxy for such considerations.

3.5 Stress

Stress is a major bane of modern society, so detection of stress warrants its own section. The literature is huge; we will mention a particular approach focused on the observers of stress. This makes those techniques suitable for the most common work and play setting of this era – looking at things on screens. We illustrate by an example using information not readily available to fellow humans, being high resolution thermal images [64, 65], see Fig. 11.

Fig. 11
figure 11

Thermal video of experimental subject watching stressful / calm videos

4 Synthesis – Deep Learning for Experiment Synchronization

Data structuring and synchronisation, a synthesis and proposal: we wish to record vast quantities of data from sensors attached to or pointing at human participants in experiment, and need to be able to reason and make comparisons between data recorded for different people using different sensors, in somewhat different settings. (Humans do this all the time!) We can use our event signatures (an extension of fuzzy signatures as described earlier) to record rich metadata along with the sensor recordings.

The plethora of individual devices means strong synchronisation signals are not possible with all or even the majority of devices. We can solve this problem in two ways, simply storing sensor device times in the event signatures as differences in device time stamps are likely to persist, but our innovative solution is automated synchronisation and data alignment using deep learning with convolutional neural networks. Deep learning has shown great success in image analysis, devised originally for optical digit recognition where the 2D structure of the data is known (digits can be translated, and even rotated somewhat and still remain recognisable to humans and convolutional neural networks). With our time sequence sensor data, we know which signals are from the same experiment and will always roughly know the time, so the task is to refine that (see Fig. 12) knowledge

Fig. 12
figure 12

Sketch of CNN: inputs are sensor signals, outputs are offset vectors

We use the deep learning approach to discover the best possible match between sensor signals using the various sensors. We provide two simple examples here: 1) frontal EEG electrode signals always contain some noise to be filtered out, caused by signals from the eye muscles during saccades: this noise is well correlated with eye gaze data and is the between-fixation times of that signal [66]. The patterns of lengths of saccades can accurately synchronise these two signals. 2) The effect of breathing on heart rate is detectable from that signal, and can be matched to warmer pixels near the nostrils/mouth [67] thus synchronising those sensors. Risks, caveats and counters: a) we note that this adaptive synchronisation will not be millisecond precise, but neither are human reactions; b) it uses the overt surface statistical properties of the signals (such as long-range correlations in the signals; we note that the specific reactions within our experiments will not have the same long range correlations due to order balancing or experiment; we gave an example using the noise from EEG to correlate with the least interesting part of the eye gaze signal); and c) that this synchronisation is testable, as we know of the different timings of skin conductance reactions as compared to fNIRS or EEG reactions to the same event, so we would be able to calibrate our adaptive synchronisation.

5 Conclusion and Future Work

We have described a longitudinal body of work in the construction of bio-inspired tools, increasingly focused on applications to signals we can directly record or extract from human behaviour, and closed with a synthesis and proposal. A follow-up report on advances in human centered computing over the past few years will be our next project.