Keywords

1 Introduction

In recent times, inclusive education has been a primary worldwide concern. Researchers across the globe are working towards providing teachers, support staff, and educators with tool-sets to support the assessment and education of children with special educational needs (SEN) as a combined approach towards inclusive education by guiding what pedagogical methodologies are most appropriate for each child depending on their needs. By improving the pedagogical support for these students, they will have an increased chance of inclusion in mainstream classrooms or success in special schools.

Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects communication and behaviour and can be diagnosed at any stage of life. There is no cure for ASD, but following a diagnosis, early detection of dysregulation events and early intervention may help to diffuse difficult situations in the classroom or at home. With the increasing prevalence of ASD, early detection and possible intervention have become an important challenge [28]. Recently, AI and machine learning (ML) are playing an increasingly dominant role in ASD detection, supporting co-curricular psychology studies. The work of [7, 16, 25, 28] used ML models, performed in silico experiments to simplify and assist the conventional clinical experiments in an optimized way.

Besides the SEN students, recently research across the globe has focused significantly on the ability of children with learning difficulties to recognize [6, 33], perceive [31, 38] and interpret [22, 37] emotional cues. So, research on attention detection or recognition of the emotional state of SEN students are still very open. Though ML is used in many cases to develop supportive tools for educators and SEN students, research in this direction and achieving a higher performance is still a challenge. In recent times, artificial intelligence and Machine Learning models are advancing to be ever more complex, going from shallow to deep learning over time. Precisely in this many ML algorithms has been proposed for attention detection i.e. [4, 11, 34, 36] they are considering unimodal data. Until this date, very few machine learning-based methods which consider multiple modalities have been developed for multimodal fusion tasks.

Identification of attention for an individual is challenging and involves multiple factors [8, 48]. Using deep learning models, we can achieve higher accuracy and greater precision. However, this also tends to make these models ‘black boxes’, reducing the comprehensibility of the logic played out in the various predictions and outcomes. This raises an obvious question - how do we understand the prediction suggested or recommended by these machine learning models so that we can place trust in them? Explainable Artificial Intelligence (XAI) [5] attempts to make a trade-off between precision, accuracy and interpretability to achieve this. Here in this work, we presented an XAI ML approach with multimodal data for attention detection.

Fig. 1.
figure 1

The figure shows a Seek-X quiz. Where for a given cue, we need to find the correct answer out of the wrong one. During an experimental setup, the participants were asked to find or seek the target object from different non-target objects acting as a matrix of noise.

2 Literature Review

ML became one of the most integrated part in research domain and playing role in many field from genomics analysis [18, 39], image processing [15, 17], text processing [14, 24], trust management [30], different prediction models [26, 41], health care [29, 35] and to a growing list of many more. Even a newer research domain well known as Multimodal Machine Learning (MML) is an emerging multi-disciplinary research domain that enhances the original goals of ML inspired AI by combining multiple complementary and communicative modalities, including vision, text, image, and many more [32].

MML models deal with heterogeneous types of data which bring added challenges to cope with the different modalities, extract data and develop knowledge from it. The process comprises the separate stages of representation, translation, alignment, fusion and co-learning, which is in itself a complex research area. Representation is the study of how to represent and summarise multimodal data which could be complementary or redundant between multiple modalities. The translation is the stage where acquired data is mapped from one modality to another. Due to the heterogeneity of data, this relationship between the modalities is a significant challenge. Alignment is the identification of the relations between multiple modalities. The next step is fusion, where information is joined from multiple modalities to make a prediction, classification or recommendations. Finally, co-learning is the stage where knowledge is transferred between modalities, their representation, and their predictive models [3].

To support practice in academia and various special needs social settings, the demand of AI embedded in non-autonomous systems is gaining interest for human cognition and enhancing learners, support staff and teachers’ capabilities. This differs significantly from approaches that aim to create fully automated AI systems. MML and its analytics aim to create AI through externalisation and replicating human cognition and design artefacts closely linked with humans to increase their cognitive abilities and improve their overall capabilities [9, 10, 13].

In a research Hilbert at. el. 2017 used machine learning on multimodal biobehavioral data to classify subjects according to the presence of a generalised anxiety disorder (GAD) from mental disorder (MD) from cortisol data, clinical questionnaire data and structural MRI data using MML [21]. In another study [47] used MML for automated international classification of diseases (ICD) coding, where the ICD coding was adopted widely by physicians and other health care workers. Another study by [45] used MML for automatic behaviour analysis to augment clinical resources in diagnosing and treating patients with mental health disorders. In a more recent study, [46] used a multimodal AI-based framework to monitor individual’s working behaviour and stress levels. Identification of this behaviour and stress levels can be achieved with higher precision by fusing multiple modalities obtained from an individual’s behavioural patterns. They used a methodology to determine stress due to workload by integrating heterogeneous sensor data streams, including heart rate, posture, facial expressions and computer interaction.

Early identification can notably improve the prognosis of children with ASD. Yet, existing identification models are expensive, time consuming, and mostly depend on the manual judgment of experts [12, 43]. A multimodal framework that can fuse data on a child’s eye movement and facial expression can help identify children with ASD and improve identification efficiency and explainability. Various ML models, used data types and modalities and their performance for attention detection have been summarized in Table 1.

Table 1. Various ML model, used data type and modality for attention detection.

3 Methodology and Data Sets

3.1 Data Collection

A child’s level of attention can be assessed using mobile devices in a non-intrusive manner. We can observe and record their body posture, facial expression, eye gaze, brain activity (EEG), thermal data, and gesture recognition as forms of data. These data can be collected via different sensors, sometimes wearable and sometimes wirelessly connected. So, a mobile device on which the child is playing a game can be used for a continuous performance test (CPT). The platform tracked students’ engagement, performance and attention with a range of sensors. Head tracking and hand tracking from a RealSense camera combined with head tracking data from a Tobii 4C sensor were used. Body positioning was tracked from the combined posture tracking and gesture tracking data from the mobile device’s motion sensors. The RealSense camera and the Tobii 4C sensor monitored facial features and eye gaze. A Muse headband (in a child-friendly design) was connected wirelessly over Bluetooth and streamed brain activity data. Figure 2 on the left shows the cartoon of target images that have been used to find- ‘Where’s Wally’ game. Where the challenges to spot Wally, a specified character, a seek-X type games. The figure on the right shows the multimodal fusion of multimodal data obtained from different sensors and their labelling. There were 2615 samples obtained from 59 sessions where 4 participants were involved. An in-detail explanation is available in [8]. Figure 3 shows the basic multimodal data flow evaluation technique. Participants were instructed to find Wally in the seek-X type game. As a part of the CPT experiment, different sensors were collecting multimodal data, such as eye-tracking, facial expression and others. After the labelling of data as by [8], we used our XAI model for attention detection. A detailed explanation of the experimental setup is available at [8].

Fig. 2.
figure 2

A detailed explanation is available in [8]

The figure on the left shows the cartoon of target images that were used in a ‘Where’s Wally’ game. The challenge was to spot Wally in a seek-X type game. The figure on the right shows the multimodal fusion of data obtained from different sensors and their labelling.

Fig. 3.
figure 3

The figure shows the basic data flow diagram. Participants were instructed to find Wally in the seek-X type game. As part of the CPT experiment, different sensors collected multimodal data like eye tracking, facial expression, and others. After labelling data as by [8], we used our XAI model for attention detection.

‘Engagement is the single best predictor of learning in students with learning disabilities’- [23]. In the Swanson’s CPT [44] experiment the participant needs to pay continuous attention to a display screen on an interactive way. Where a game provides them with a pre-defined signal detection challenge. We will say this CPT ‘Seek-X type’ game as [8] to label multi-sensor data. During the experiment, the participants were asked to find the predefined target object from other non-target objects acting as a matrix of noise like a ‘Where’s Wally’ game. The challenge is to spot Wally, from a grid displayed on the screen. The size of the grids of characters in which to spot Wally in a crowd of characters can be varied. The CPT outcome measures and labels these multimodal data (facial expression, eye gaze, body posture) into high and low attention regions. This provides the labels by which we can assess engagement in the live system.

At the data level, information is highly abstract and the main focus of data fusion is noise reduction and compression. At this level, raw data is processed. Data fusion provides an opportunity for data reduction through data correlations and redundancies. At the feature level, the data has already been processed and the features have been extracted. The fusion is applied to the features themselves rather than the raw data. At the decision level, the data is highly semantic and clear temporal behaviours can be seen in the data. A further detailed explanation of the data prepossessing and fusion is available at [8]. Data frames from these three levels of abstraction with their corresponding CPT attention level labels are used as input into the machine learning layer.

Fig. 4.
figure 4

The figure shows a decision tree where at the root node (layer 1) contains all the instances in a mixed orm. Then it splits into two determinations by predictor variable which is also known as a splitting variable that splits between the left child node and the right child node. For a splitting variable, the split criterion depends on some scoring like the Gini Index or Entropy.

3.2 Decision Tree

The decision tree, a machine learning model, is commonly used in ML, data science and related domains to construct classification tasks based on multiple features or for building prediction algorithms given target variables. If a data set has a mixture of continuous, categorical, and binary types, we can use a decision tree algorithm for better prediction. The decision tree asks yes/no type-specific questions and take decisions. This model classifies a given population into branch-like segments constructing an upside-down tree having multiple levels or heights with the root node on the top level, internal nodes in between levels, and leaf nodes at the bottom. This ML algorithm is a non-parametric model where no parameter tuning is required at the prior stage and can efficiently deal with a large volume of data. The mathematical formulation is also simple and does not impose a complicated parametric structure. Two branches from a parent node are constructed based on the similarity of the data for a given feature, where impurities are calculated by entropy or Gini index. Figure 4 shows a decision tree. During the development of ML models, the data can be divided into two categories. The first segment is the training set, and the second segment is the testing set. A 75% and 25% or 80% and 20% train and test dataset split is a good choice. Yet, k-fold cross-validation is also widely used in the research community for decision trees. However, to leave one out could be a poor choice if the data size is huge. We use the training data set to construct a decision tree and the test dataset to evaluate its performance to construct the final optimal model [40, 42]. We can calculate the accuracy of decision tree algorithm prediction by Eq. 1 where TP indicates the true positive, FP indicates the false positive, TN indicates the true negative, and FN indicates the false negative

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+FP+TN+FN}. \end{aligned}$$
(1)

3.3 Gini Index for Decistion Tree

Impurities in a decision tree are calculated by the Gini Index (GI), which is also known as Gini impurity. When selected randomly, for a specific feature, GI calculates the probability of that classified incorrectly. If in a single class all the elements or samples are linked with or similar type then this class can be termed as pure. GI varies in the range between 0 and 1, where 0 expresses a pure class obtained from a classification, such that all the elements belong to a specific class, whereas a GI score of 1 indicates an absolute impure or distribution of elements came from a random nature. The GI value is somewhat at the middle shows a nearly equal distribution of samples or elements over some classes. During the modelling phase of the decision tree, the feature providing the least GI value is preferred. The GI can be calculated by Eq. 2 which is determined by calculating the sum of squared probabilities of every classes from one. Mathematically-

$$\begin{aligned} GI = 1 - \sum _{i=1}^{n} \left( P_i\right) ^2 \end{aligned}$$
(2)

where \(P_i\) represents the probability of a sample being classified for a distinct class.

Fig. 5.
figure 5

The figure shows the decision tree for CPT of attention detection from multimodal multi-sensor data up to layer 6. This is an explainable approach and we can easily explain the process of decision making. Due to the size of the tree and the given size of the page, the outcome isn’t readable. Yet, a better resolution picture will be easily readable.

Fig. 6.
figure 6

The figure shows the decision tree for CPT of attention detection from multimodal multi-sensor data up to layer 4. This is an explainable approach, and we can easily explain the process of decision making.

Fig. 7.
figure 7

To get better visualization, we pruned the number of layers in this figure. From the figure we can see that in the root node eye dwelling is the feature that best split the attention and non-attention classes of the data, using as a threshold a value of 22.859. The GI score here is 0.259. From the root node, we get two classes. At level 1, in the left node, we got 4906 samples and in the right node, we got 4733. For the left node of level 1, the threshold value of eye dwelling is 19.767 which splits 4908 samples to further two classes with 640 (left) and 4266 (right) samples with a GI score of 0.227. In the 3rd node of level 2, we can see that the GI score is 0.454, which means both attentive and non-attentive classes are grouped together here.

4 Result Analysis

From the figure we can see that the root node starts with 9639 samples of each of the two classes, with a Gini Index. This is a categorical tree where a lower GI represents a better split. Figure 5 and Fig. 6 shows the full splitting mechanism to spilt the data and to measure the decision taking process of attention detection. However, due to the number of levels of the tree, it might not be readable yet a higher resolution image explains the full scenario. To get a better understanding we pruned the number of layers in Fig. 7 just considering four levels. The figure shown in the root node, eye dwelling is the feature that best split the attention and non-attention classes of the data, using as a threshold a value of 22.859. The GI score here is 0.259 which is not a pure class there are similarities in the chosen class but some impurities are also there. From the root node (let’s say level 0) we get two classes. At level 1, in the left node, we got 4906 samples and in the right node, we got 4733. For the left node of level 1, the threshold value of eye dwelling is 19.767 which splits 4908 samples to further two classes with 640 (left) and 4266 (right) samples with a GI score of 0.227. In the 3rd node of level 2, we can see that the GI score is 0.454, which means both attentive and non-attentive classes are grouped together here. In all of these nodes, all the other features of the data (eye blink, squint, eye gaze inward and outward, facial feature smile, frown, head tilt and ppi) were evaluated and had their resulting GI was calculated, however, the decision tree shows that feature that gave us the best results in terms of GI score is eye dwelling.

Figure 8 shows the cooperative performance.In this figure, the left bar graph shows the accuracy for attention detection using our XAI model decision tree for individual modalities. Here the performance for attention detection is shown considering only eye blink, squint, eye gaze inward and outward, facial feature smile, frown, head tilt and ppi as a unimodal feature. Where the right sidebar graph of Fig. 8 shows the comparative performance of our XAI model with different existing ML approaches. The performance of our model is not the best but it did come from an explainable ML algorithm decision tree.

Fig. 8.
figure 8

The figure on the left shows the accuracy for attention detection using an XAI model decision tree for individual modalities. Here we considered eye blink, squint, eye gaze inward and outward, facial feature smile, frown, head tilt and ppi as a unimodal feature. Where the figure on the right shows the comparative performance of our XAI model with different existing ML approaches. The performance of our model is not the best but it did come from an explainable approach. However, as they worked on different dataset the results may also vary as mentioned by [2]

5 Conclusion

In this research work, we presented decision trees from an XAI model for a continuous performance test obtained by monitoring multi-sensor data and multimodal machine learning, for engagement analysis. We considered body pose, eye gaze, interaction data and facial features by objective labelling of engagement or disengagement for cognitive attention of a Seek-X type task execution. We used decision trees, an XAI algorithm, to visualize the decision process of multi-sensor multimodal data, which will help us assess the accuracy of the model intuitively and provide us with the explainability of engagement or disengagement for visual interactions. The accuracy of the model does not give the best possible results, but helps decision making - and it is important that this model is more explainable than the black box-like algorithms of machine learning. As engagement is the single best predictor of learning in students with learning disabilities, we believe, an explainable model for engagement analysis will help to develop a tool useful in inclusive education by assisting teachers, supporting staff and educators with the assessment of children with SEN.