1 Introduction

Intelligent tutoring systems (ITS) help students to improve their learning process (McCartin-Lim et al. 2018). These systems have been used in different areas of teaching such as mathematics (Griffith and Griffith 2017), medicine (Suebnukarn and Haddawy 2004), electronics (Graesser et al. 2018), natural languages (Ghali et al. 2018), among other. It has been shown that using emotional classifiers in ITS to determine the emotional state of students helps the teaching process (Linnenbrink-Garcia and Pekrun 2011). ITS use decision making algorithms. One of the most common algorithms for decision making is fuzzy logic (Fahmi 2018; Fahmi et al. 2018; Fahmi and Amin 2019). Human emotion detection is important in the implementation of decision making in ITS. Human emotions were first classified by Ekman in his work Ekman (1992), where he categorized them as basic human emotions. Human emotion recognition is the process of predicting physical expressions such as facial expressions or brain signals and is characterized by features that allow us to differentiate one emotion from the others. For example, in speech, we have features such as the tone of the voice. In body expressions, there are features like the position and movements of the body. Even heartbeats and brain signals are features that express emotions, but their detection require special and invasive devices (Piho and Tjahjadi 2018). The face is one of the most expressive parts of the human being, since this is his primordial channel of communication to express emotions.

The recognition of emotions through the face has been one of the most addressed topics by researchers in affective computing (Zeng et al. 2009; Calvo and D’Mello 2010; Shan et al. 2009). This is usually referred to as recognition of facial expressions. Within a learning environment, the most important emotions to handle are those that have to do with the teaching–learning process (e.g., bored, frustrated, confused, and engaged). In research work, the recognition of emotions has been carried out using different classification techniques such as artificial neural networks (ANN), support vector machines (SVM), or Naïve Bayes (Chakraborty and Kopparapu 2016; Bhakre and Bang 2016). However, in recent years, convolutional neural networks (CNN) have proven to be very successful (González-Hernández et al. 2018; Kumar et al. 2017). One problem with CNNs, however, is the tuning of a large number of hyperparameters. This process is commonly carried out through trial and error or based on previous and similar works. Due to the dimensions of the hyperparameters when using this technique, it is difficult to reach a quasi-optimal topology that allows to ensure the improvement in the accuracy of recognition.

The main novelty and contribution of this paper is the implementation of a methodology that improves the recognition rate of learning-centered emotions (facial expressions), by means of a genetic algorithm for optimization of hyperparameters in a CNN. The emotion recognizer was adapted to an intelligent learning environment (ILE) called Multi-Sensei, which uses cognitive and affective variables in its adaptive learning process. The ILE and the emotion recognizer run inside tablets and cell phones.

This paper is structured as follows: Sect. 2 shows the related work of facial expression recognition, convolutional neural networks, and affective tutoring systems. Section 3 presents the work done for creating an emotional corpus, optimizing the CNN, implementing the CNN in Android mobiles, and integrating the CNN into an intelligent tutoring system. Section 4 shows the result and discussion when testing the optimized CNN, and finally Sect. 5 presents conclusions and future work.

2 Related work

2.1 Facial expression recognition

There are different techniques to perform the recognition of facial expressions. A technique that bases its recognition on the appearance or texture of the face applies operators and filters to the image in order to obtain a set of features that are representative of the face. Local binary patterns (LBP) is a method of this type that takes a grayscale image and divides it into different areas. To obtain a frequency histogram, an LBP operator calculates each area. There are several works with important results that have been implemented based on this method (Zhang et al. 2016; Parkkinen et al. 2016; Xu et al. 2015). Techniques based on geometric distances are methods that work using key points of the face. These points can be located using expression templates, unit actions of the facial action coding system, or distance training by means of examples obtained from facial expression corpora. There are also several important works for recognition of facial expressions based on geometric distances, with excellent results. In Pu et al. (2015), the authors present a facial recognizer of image sequences using a twofold random forest (TRF) classifier for Ekman’s basic emotions. In this case, the classifier analyzes facial expressions by means of their action units. Also, in Ghayoumi and Bansal (2016), the authors propose a method that unifies three different techniques: facial action units, geometric features, and graph-based modeling. Also, Salmam et al. (2016) measures the distance of six geometric units through the face. The three types of methods for measuring distances were Euclidean, Manhattan, and Minkowski. JAFEE and CohnKanade datasets were used to train the classifier. In recent years, CNNs were used in the recognition of facial expressions. In the work in Yu and Zhang (2015), a method for the recognition of static facial expressions is presented; the authors use three techniques to detect faces using the facial expressions in the wild (SFEW 2.0) dataset. The techniques are joint cascade detection and alignment, Deep CNN, and mixtures of trees. In addition, a five-layer CNN architecture was proposed, but instead of adding pooling layers in each connection between convolutional layers, a stochastic layer is used. The results had an accuracy of 83% for the happy class, 73% for the surprise class, and below 70% for the rest of the classes. In Ding et al. (2016), the FaceNet2ExpNet network, a convoluted neural network designed for the learning of basic emotions from static images of small databases, is proposed. The authors tested their architecture by observing the visualizations generated during the training epoch. They observed that the layers maintained high-level information of the faces and emotions that were being classified. The evaluations with the CK + dataset obtained a prediction from 90 to 100%, with OuluCASIA from 75 to 94%; with PDT from 83 to 94%, and with SFEW from 10 to 100%. In González-Hernández et al. (2018), the authors tested two CNNs to classify basic emotions. The first CNN explored the impact to reduce the number of layers of deep learning while the second CNN divided the input images horizontally into two parts based on the positions of mouth and eyes. Training and testing were performed on the Karolinska directed emotional faces (KDEF) dataset. The first architecture obtained a precision of 96.63%, while the second architecture obtained a precision of 86.73%. In Burkert et al. (2015), a CNN architecture was proposed that consisted of four main processes. The first one preprocessed the data automatically. The next process performed a grouping, which reduced the sample size of the images and performed a normalization using the LRN1 algorithm. The next two processes were two feature extraction procedures in parallel (FeatEx for its acronym in English). These procedures were the core of the architecture. The implementation was made with the Caffe library and tests were performed using the CKP dataset, obtaining an average accuracy of 99.6%.

2.2 Convolutional neural networks

In the literature, we find a large number of works where convolutional networks are used for image classification, showing a great performance when compared to other image classification techniques. The CNNs are made up of different filters of more than one dimension. Defining the topology of the CNN is a complex task, due to the number of hyperparameters to optimize. For this reason, different approaches have emerged for the definition of an optimal topology. In the work of Bergstra and Bengio (2012), they perform a comparison among grid search, random search, and manual search. They used neural networks and deep belief networks. The comparison among these three methods with a 32-dimensional combination found statistically equal performance on four of seven data sets and superior performance on one of seven. In Snoek et al. (2012), the authors perform the optimization of a convolutional neural network under the CIFAR-10 dataset. They argue that a Bayesian treatment with a Gaussian process kernel is preferred to the approach based on optimization of the Gaussian process hyperparameters. The Gaussian process is a convenient and powerful prior distribution on functions. They also present algorithms to perform experiments in parallel. They mention that though this process has its limitations, in their experiments, beat the state of the art by 3%. In other works, reconfigurable computation patterns have been used in CNN, like in Tu et al. (2017), where they perform a comparison of several network architectures. The authors shifted different computational patterns for power usage optimization. In addition, there have been other works that try to solve the problem of hyperparameter optimization using more traditional approaches such as Bayesian optimization, gradient-based methods, mesh optimization, random optimization, and sequential search.

Fig. 1
figure 1

Method to create the emotional dataset

2.3 Optimization of complex tasks

On the other hand, in the literature, you can find a large number of search/optimization techniques. Among them, there are the basic ones such as grid search or random search (Dinh and Van der Baan 2019). You can also find other techniques such as the use of PSO for the scheduling problem (Deng et al. 2019), or for multi-objective problems such as airport gate assignment (Deng et al. 2017a). It is also valid to use more than one technique as in the work of Deng et al. (2017a) where PSO is used with least square combined with fuzzy information entropy for the fault diagnosis of a motor bearing (Deng et al. 2018). Evolutionary computing has also been used with other techniques to solve complex optimization problems like in the work of Deng et al (2017b). The neural evolution or the use of evolutionary computation in the family of neural networks is not new. It has been applied successfully for several decisions tasks (Miikkulainen et al. 2017; Floreano et al. 2008; Montana and Davis 1989) like control, robotics, artificial life, among others. Neural evolution is usually used to define the weights in smaller nets or combined with gradient descent algorithms for larger nets. In the work of Miikkulainen et al. (2017), they define the deep evolution as a multilayer optimization, using the evolutionary computation as a high-level optimization (i.e., layer, size, activation function, etc.) and the gradient descent algorithms for the low-level optimization (i.e., connection weights).

2.4 Affective tutoring systems

In this section, we have considered only those systems oriented to a particular domain that handle affect recognition as part of their teaching methodology. An example of an affective learning environment is found in JavaTutor (Wiggins et al. 2017). This tutoring system is an intelligent learning environment for teaching the Java programming language and its way of detecting affect is through the user’s written dialogue. In Arevalillo-Herraez et al. (2017), an ITS for the teaching of algebra is presented. Throughout the learning process, the tutoring system takes care and monitors the cognitive aspects of the student. A sensor free affect detection module supports the cognitive part. After completing a problem, the student is asked about his affective state in terms of valence, activation, and autonomy. Thompson presented Genetics with Jean in Thompson and McGill (2017). This is an affective and intelligent tutor for the teaching of genetic topics. In the affective model, a dimensional view of the emotions is used; its parts are the affective activation (the degree of activation or excitement experienced by the student) and the valence (how positive or negative the experience is considered). In Lin et al. (2016), the authors propose an educational system of distance instruction where the affect or emotion of the student is recognized. The domain of the tutoring system is digital art. Within its design, a multimodal detection system is proposed which uses the recognition of emotions through facial expressions and the conductivity of the skin through galvanic responses. In Wixon et al. (2014), the recognition of emotions is for four affective states: confidence, agitation, frustration, and interest. The tutoring systems used in the study were MathSpring and Cognitive tutor algebra; both ITS for the basic education of mathematics. In Wang and Lin (2018), a prototype of an ITS is proposed for the teaching of physics subjects. Two modules of emotion recognition were implemented: a semantic recognizer and a facial expression recognizer that uses the EmguCV library with Haar features.

3 CNN for emotion recognition

Next, we present the different steps required to implement a CNN for facial expression recognition in learning-centered emotions with application to an ITS. The CNN was optimized using genetic algorithms (GA). The steps that involved the creation of the system were creating a corpus, designing and optimizing the CNN, embedding the CNN to a mobile device (Android), and integrating the CNN to an ITS.

3.1 Creating a corpus

An essential part of any recognition system is the dataset for training. Datasets contain relevant information for any recognition system to be able to discriminate and classify important data. We proposed a method for building a facial expression dataset using the EEG-based brain computer interface (BCI) system Emotiv Insight. This interface system captures the brain activity and provides information about the emotion that the student is presenting. Next, we describe the used device and methods. We searched for a method that could capture expressions within an educational context. In addition, we searched for an activity related to the domain of the ITS that uses the facial recognition system. We set the EEG devices to 38 students, of which, 28 were men and 10 were women. The students were writing code in Java. At the same time, the Emotiv devices obtained their affective state. A labeling of the emotion with respect to each student is carried out automatically by an application. Figure 1 presents the method for creating the emotional dataset. First, the student writes code for a problem. The Emotiv Insight device obtains the student’s brain activity (EEG signals) and the webcam takes a photograph of the student’s face every 5 s. Second, every student’s face is labeled by the system with the actual emotion obtained from the Emotiv device. Third, the student face (photograph) labeled with the emotion is saved into the emotional Dataset. Fourth, we evaluate the matching between emotional labels and the facial expressions saved into the dataset. When we find there is no match between emotion and facial expression, we discard both from the dataset. In the end, the emotional dataset contained 5560 labeled images. The labels used for the classification of the images were boredom, engagement, excitement, focus, relaxation, and interest. This dataset was named database Insight (dbI).

3.2 Designing and optimizing the CNN

A CNN is a class of deep, feed-forward artificial neural network, most commonly applied to image analysis. The image preprocessing in a CNN is less complex compared to other image classification algorithms. This is because the convolutional part of a CNN extracts the main features of an image. The CNNs have a multilayered architecture with three types of layers. The first type of layer, called the convolutional layer, extracts the main features and patterns from an image. The second layer, called the pooling layer, decreases the number of final features from the previous step. Finally, a fully connected layer classifies the features and patterns obtained from the previous layers. Figure 2 shows the three layers of a CNN.

Fig. 2
figure 2

Common architecture of a convolutional neural network

The input is an image, and this image goes through the convolutional layer. Then, the features are reduced in the pooling layer to finally be classified in a fully connected neural network. The hyperparameters of a CNN define its topology. The hyperparameters used in this work are described as follows:

  • Layers: Number of layers of the CNN.

  • Filter shape: Filter size used to extract the features and patterns in a convolutional layer

  • Stride: Number of steps that are taken while moving a filter over an image.

  • Dropout: Probability of dropping a certain number of neuron connections from the fully connected layers to prevent overfitting.

  • Batch size: Number of samples used at a given time to train the CNN

  • Activation function: Output of the nodes on a neural network given an input.

  • Optimizer: Algorithm that helps to minimize the error function with each iteration.

  • Training size: Number of images from the dataset that are used for the training process.

  • Number of neurons: Number of neurons in the fully connected layers.

  • Epochs: Number of iterations for the training of the CNN.

3.2.1 Optimization of the CNN

The optimization on the CNN is made through the change in hyperparameters with the purpose of achieving a higher accuracy in prediction. There have been some optimization methods in CNNs like random search (Salmam et al. 2016), practical Bayesian optimization (Yu and Zhang 2015), among others. For this process of optimization, a GA was used. The purpose of the GA is to evolve a set of individuals called a population to obtain the best solution for a problem. This is achieved by submitting the population to aleatory actions in resemblance to those in biologic evolution. These individuals are rated by a score or fitness. This scoring is calculated by a fitness function which is a mathematical representation of the problem. This fitness also determines the probability of being selected for reproduction depending on the selection method. In GA, an individual is a coded representation of a solution usually called a chromosome which is a set of genes that define the CNN topology. GA follows four main steps which are explained in Algorithm 1. These steps are creating a random population, evaluating the fitness of the population, selecting these individuals to evolve, and evolving the selected individuals. The evolution is performed by crossing over and mutating the selected individuals for the creation of the next generation. This process is repeated for several generations until one of two conditions is met: the fitness of an individual surpasses a threshold or the maximal number of generations is met. The results give a highly fit trained CNN.

figure a

GA starts with a random population where each individual is coded with floating point values from 0 to 1 used to represent the CNN hyperparameters. One hyperparameter per gene was used to code the CNN topology. The coded (normalization) process was performed using Eq. 1.

$$\begin{aligned} \begin{aligned} y_i = \frac{X_i - X_{\mathrm{min}}}{X_{\mathrm{max}}-X_{\mathrm{min}}}\\ \end{aligned} \end{aligned}$$
(1)

where \(X_i\) represents the data to be normalized, \(X_{\mathrm{min}}\) represents the new minimum value, \(X_{\mathrm{max}}\) the new maximum value, and \(y_i\) the result of the normalization equation. The evaluation of an individual starts by decoding each gene inside the chromosome. This process is performed clearing \(X_i\) from Eq. 1. The ceiling function was applied to the result and assigned according to its type. The hyperparameter types are described in Table 1.

Table 1 Type of data by hyperparameter

CNN hyperparameters were split in topology and training process hyperparameters. For the optimization process, 22 genes were used for topology hyperparameters and 4 genes were used for training process hyperparameters. Figure 3 shows the representation of the chromosome.

Fig. 3
figure 3

Representation of the chromosome for hyperparameter selection

The fitness function used to evaluate the individuals in the GA is shown in Eq. 2.

$$\begin{aligned} \mathrm{fitness} = \frac{\mathrm{accuracy}}{\mathrm{loss}} \end{aligned}$$
(2)

In this equation, accuracy represents how many correct classifications from the validation set the CNN made. Loss represents the distance between the classification made and the correct classification. Therefore, the higher the accuracy of the CNN the better fitted the individual is for survival and, in the same way, the lower the loss from the classification the higher the fitness of the individual. For the fitness evaluation of all individuals, an algorithm was created for instantiation of the CNN. This is presented in Algorithm 2. Algorithm 2 is divided into three parts: the first part decodes the chromosome. This decoding process defines the layers, the epochs, the training algorithm, among others. The second part defines the topology of the CNN according to the parameters decoded in the first part. The third part compiles and trains the network defined in the previous part. Finally, Algorithm 2 returns the accuracy of the training process.

figure b

In Algorithm 2, the process for fitness evaluation is demonstrated. The upper section (lines 1–9) of the algorithm involves the decodification of the GA chromosome. In the middle section (sequential modeling—lines 11–34), a FOR-LOOP is used for the construction of the convolutional and dense layers, where logical conditions are used to ensure that the model is properly built. In the final part (lines 35–39) of the algorithm, the output layer is added as a dense layer, using a number of neurons corresponding to the number of output classes with the SoftMax function. Finally, the built network is compiled and trained. A value of fitness is calculated and returned using Eq. 2

Fig. 4
figure 4

Process of recognition of emotions through an android exported model

3.3 Adapting the CNN for mobile devices

A trained CNN can be represented as two things: the topology and the weights. A process named serialization can store a CNN. Serialization is the process of converting data structures or objects into a format, which can be saved in long-term storage. For portability, the CNN was serialized by using format protobuf developed by Google with the use of TensorFlow (Abadi et al. 2015). To achieve this, the next steps were followed: first, the CNN was trained using a personal computer with the graphical processor unit (GPU). After the training, a checkpoint was saved using library TensorFlow. This step creates two files containing a graph representing the CNN topology and all its metadata. These files were then serialized and transformed into a single protobuf file. The protobuf file was stored and exported as a component of the ITS in the Android mobile device.

Once the CNN model on protobuf format was stored in the mobile device, TensorFlow was used to deserialize the CNN and create an instance of the previously trained CNN in the ITS. This CNN was instantiated and used to classify the user’s facial expressions. Facial expression recognition is related to educational emotions (bored, frustrated, engaged, excited, relaxed, and focused). The process for this was as follows: an image was obtained from the front camera of the mobile device. By using OpenCV (Bradski 2000), the user’s face was detected. Then, the user’s image was reshaped to 150x150x1 dimensions, as this was the same shape used to train the CNN. The reshaped image was then transformed into a Bitmap. The Bitmap was then converted into a vector and the vector was given through a java native interface (JNI) call to the TensorFlow CNN. The CNN classified the data and sent back an array containing the six emotions. Each emotion is represented with a value that denotes the probability of being recognized. Figure 4 illustrates this process.

3.4 Integration of a CNN into an ITS

We have developed an affective ITS named Multi-Sensei that helps elementary school students to learn and improve their multiplication and division skills. Multi-Sensei adapts the exercises presented to each student depending on the cognitive and affective values, number of aids, time spent in exercise, number of errors, and actual emotion. Depending on the fuzzy values of each cognitive and affective variable, a fuzzy logic system determines the complexity of the next exercise. Figure 5 illustrates the evaluation process of Multi-Sensei. For clarity purposes, an example of a fuzzy rule is also shown as part of Fig.5.

Fig. 5
figure 5

The affective and cognitive values of the fuzzy logic system in ITS Multi-Sensei

4 Tests and discussion

The tests consisted in measuring the accuracy of the emotion recognizer using GA for hyperparameter tuning. The recognizer was compared with four different classification algorithms: a support vector machine (SVM), an artificial neural network (ANN), a K-nearest neighbors (KNN), and a CNN for learning-centered emotion classification presented in a previous work (González-Hernández et al. 2018). The classifiers used three different feature extraction techniques: local binary patterns(LBP), geometric based feature extraction, and convolutional filters (CF). As mentioned before, this test was made with the use of the dataset dbI, the corpus created with the use of Emotiv Insight. The dataset class distribution (number of images for each emotion) is shown in Table 2.

Table 2 Class distribution for dataset build with Emotiv Insight

The dbI has six classes of spontaneous learning-centered emotions that were used to train the CNN. The hyperparameter tuning was made with the use of a GA with a population size of 100 individuals with 24 genomes per individual, a total of 50 generations, a crossover probability of 80%, a mutation rate of 20%, and selection was through tournament. Topology and training hyperparameters with their accepted values for tuning the GA, are shown in Tables 3 and 4.

Table 3 Topology hyperparameters
Table 4 Training hyperparameters

With the use of the GA, the best-fit individual obtained an 82% accuracy rate compared to 74% using the same dataset in the previous work (González-Hernández et al. 2018). This value represents an 8% improvement only by hyperparameter tuning with GA when compared with CNN classification and is a huge improvement when compared with other methods of classification and different feature extractors. Table 5 shows the results of the comparison of different techniques for learning-centered emotion classification developed in the previous work (González-Hernández et al. 2018). The first four rows correspond to the different methods previously tested, and the fifth row corresponds to the result of this work.

Table 5 Method comparison for learning-centered emotion classification

This improvement shows that even with an unbalanced database, we can obtain favorable results. We believe that the result should improve once the dbI database undergoes a filtering process and a class distribution balancing.

For the tests and experiments, we used Python as the programming language, Keras for model creation with TensorFlow 1.10 as back end on a server using an i7-6700 HQ CPU with 8 cores at 3.5 GHz of core speed, a Nvidia GTX 1060 graphics card with 6 GB of dedicated video memory and 1280 cuda cores, and 16 GB system dedicated RAM.

5 Conclusions

This work presents the use of a GA for hyperparameter tuning in a CNN for the recognition of learning-centered emotions. The results suggest that the use of a GA for automatic hyperparameter definition of CNN helps to improve the accuracy when compared with CNN made with trial and error and other different machine learning classifiers. We still need to perform more tests to compare our CNN with other classifiers also using deep learning algorithms (e.g., long short-term memory networks). Due to the current importance of mobile learning, we implemented our CNN within a mobile device for Android. This will allow us to support different intelligent learning environments for mobile phones, such as Multi-Sensei or others. The CNN can effectively support an intelligent tutoring system to personalize its decisions based not only on cognitive aspects but also on affective aspects of the student. We are currently testing the ITS Multi-Sensei using the emotion recognizer with an optimized CNN. Image processing of the user face was performed using OpenCV library (Bradski 2000). The CNN was implemented with TensorFlow library (Abadi et al. 2015). The GA was written in Python. Multi-Sensei was coded in Java. As for work in the future, we will balance our dbI dataset by increasing the classes focused, relaxed, and interested. This should produce better recognition rates in our CNN. We also want to test our optimized CNN with sentiment analysis and opinion mining. On the other hand, we will test the ITS Multi-Sensei with students from local schools. In this case, we will check the effect of recognizing and handling emotions while the student learns the multiplication and division process.