1 Introduction

This work is placed on the applications of real-time multimedia content analysis and, on the other hand, the analysis of human user. We present a real-time visual attention-scoring algorithm and a real-time visual engagement-scoring algorithm. Multimedia is an essential element of e-learning, but analysis of multimedia content for its visual engagement is not considered well before [1]. Visual engagement refers to the visual attractiveness of the video lecture for a student. This measure is useful to calculate the effectiveness of the video lecture. Student and teacher are the human users in this work, and they can take advantage of this analysis. Student’s visual attention is measured with the attention-scoring algorithm. Both of the measures are useful for student and teacher to improve the learning experience and to improve the teaching performance [2].

We have demonstrated the analysis of video lectures by getting a video stream from the video file and extracting images one by one. Each image is converted to grayscale image, image processing is performed to enhance it, and then face detection followed by eyes detection is performed. The same process is performed on the student side. A video stream is captured from the webcam mounted on student’s laptop of computer [3]. Each image is enhanced, and detections are performed. This real-time recognition and tracking [4] of faces and eyes in the video produce two types of dataset.

Student–teacher interaction using real-time multimedia analytics has been nearly understudied. The aesthetic qualities of interaction design were not largely researched. Interaction design is a feature that is distinctive to the computer when the user is human. In case of e-learning, both the human users do not interact directly, so they have to use computers between [5]. Video lecture recording starts with the interaction of a video camera setup [6], whereas the student is not present. Recorded video lectures are delivered to the student. When a student takes lectures, no teacher is present in front. The student has to interact with a computer. Both of human users need to interact with computers. Elimination of face-to-face interaction and introduction of human–computer interaction by the user of multimedia content create an opportunity to analyze the situations with real-time analysis. Video services are becoming more pervasive due to the swift development of the next-generation communication and network systems. With the growth of these types of services, the numbers of users are increasing in exponential rate and numbers of devices are being used for it, e.g., wearable equipment, smartphones, tablets, personal computers or laptops, and smart televisions. Real-time object tracking is a challenging area of research, and an increasing number of good results are reported with respect to robustness and accuracy. Frequently, no clear statements about the tracking situation are made, when it comes to the application of these methods to real-world problems [4]. Human visual system (HVS) cannot give the same attention to the whole scene simultaneously especially when watching a video on a display device. Hence the overall perceptual quality of HVS from the subjective point of view can not be fully determined.

The objective of the study for an imaging system is to explore student-focused software applications of tracking and recognition of student and teacher in e-learning [2]. Perceptual inspired imaging methods have been widely used in many areas of visual information processing and multimedia tools and applications [7]. As the multimedia system is producing videos and other types of data, e.g., sounds and images that are eventually perceived by a human, it is important to note how HVS is using this information [8]. The HVS is only partially understood due to its complex nature, which exhibits many nonlinearities [9]. Hence, it is an opportunity, which can be exploited in multimedia systems with computational science, and aesthetics [10]. The HVS establishes communication among eyes and the brain to respond external stimuli in the form of multimedia content [11]. Human vision is one of the senses that we depend on more in our daily life. Human vision works very fast to detect the objects and recognize the scene in front of the eyes [12]. Watching videos for learning purposes require the good visual content of the video lectures so that students can extract useful information [13]. The current study is about the measurement of video lectures for the visual engagement regardless of the teaching material or content covered in the lectures. Analysis of video lecture frame by frame is called visual engagement analysis. The frame-by-frame analysis of video lectures produces a dataset comprising a visual engagement score and an appropriate class label. In the same fashion, the student’s attention is measured using a webcam or video camera setup [6, 14]. The measurement of student’s attention using a webcam is called visual attention. Our vision is a fine-tuned process [13]. The brain interprets the images captured in the process of vision. The presence of a face, eyes and furthermore checking the eyes are open or close allow measuring visual attention and preparing a dataset to further analysis as well as the application of machine predictors to extract and predict the student learning experience [15]. Both datasets for visual engagement and visual attention are merged by transforming them into time-based data.

Student–teacher interaction is a critical part of the learning process at both ends. Effective lecture delivery is also essential for the students to learn. In the case of recorded video lectures, the visual attractiveness of the video content makes the students more attractive and engaged toward lectures [16]. e-learning refers to the type of learning where the student has to learn with own pace and no teacher is there to monitor [17, 18]. Automated tools can help such kind of scenarios as well as the visual engagement of the video lectures can also be measured using this type of tools [19]. During this research, a tool “Video Lecture Analysis” is developed. A video of standard quality usually has 16–20 frames per second, but this can be more if the video is of high quality (HD). The tool analyzes the video lecture or video presentation and produces a dataset. A video is processed frame by frame, and each frame is marked with the visual engagement of the teacher, e.g., presentation, partially engaging, partial eye contact with the camera, and fully engaging [6, 20]. The interactive video constructs are developed by time-based video constructs. This article presents a qualitative assessment methodology in an e-learning environment with the analysis of student attentiveness and teacher’s performance measurements. A model is presented to demonstrate the association of attentiveness and engagement for both students and teachers, respectively. Furthermore, an introduction that describes the overall and background of the problem in learning and teaching proposed methodology to collect, analyze, and extract the information from student and teacher ends, results and discussion, literature review, and the conclusion of the study.

This paper is composed of five sections and organized as follows: Sect. 1 is the introduction of the paper, which illustrates the importance of the study and relationship between real-time imaging model and e-learning. Section 2 is the literature review, which is further divided into two sections. Section 2.1 is about the human visual system and applied visual perception; Sect. 2.2 is about the real-time computer vision and human–computer interaction. Section 3 is about proposed methodology, i.e., real-time imaging-based assessment model. Section 3 is further divided into two subsections. Section 3.1 is about tracking student’s visual attention, and an algorithm for real-time visual attention-scoring algorithm is presented. Section 3.2 is about tracking teacher’s visual engagement, and second algorithm about real-time visual engagement-scoring algorithm is presented. Section 4 gives results and discussion. Section 5 is the conclusion and future work of the current study.

2 Literature review

The goal of this section is to explore real-time models [4], algorithms, and technologies needed to enable multimedia systems. State-of-the-art research related to real-time imaging systems with computational analysis that can invoke the human emotional responses is the part of this section [21]. The real-time aspects of video processing systems and technologies including experimental prototypes, theoretical studies, emerging trends, and applications are also reviewed in this section [22, 23]. Content-based information retrieval from the video is very common phenomena in the literature [24]. A set of rules is applied to search the video content by extracting the sequence of the image [23]. At the time of the search, each image denotes a search state [25]. Comparison of different models is shown in Table 1.

Table 1 Feature set comparison used in different models for interaction assessment

2.1 Human visual system and applied visual perception

The ability to process visual details is due to the human visual system, which is the part of human central nervous system. The stimulus is received and interpreted to extract the information from the video [35]. The visual system performs a number of complex chores, including the identification and categorization of visual objects [36]. The HVS model is used by computer vision experts to video processing and image processing to deal with psychological processes as well as biological to get the full insight of human working for visuals and the better design of intelligent interfaces [37]. Such a model is used to simplify the behaviors of what is a very complex system [38]. HSV gives humans the ability to process visual detail [36, 39]. It interprets information after the detection using the biological phenomena to build a representation of the environment [40]. Student receives the visual stimulus from the video lecture [41]. The presence of the teacher’s face encourages the student and visual attention is increased toward the lecture and same is the case if each image of the video is aesthetic and have a good contract then the visual attention will be good [19]. Repetitive visual stimulus is a visual stimulus that has a characteristic [39]. The stimuli are simultaneously presented to the user when focusing his/her attention on the corresponding stimulus [39]. In typical, an entire video stream is initiated after it is divided into multilayered substreams to ensure the diversity of multiple devices and to satisfy the various demands from users [42]. Following the initiation, these substreams are distributed and transmitted through multiple subchannels in parallel to diverse end-users [43]. In order to enhance the recognition performance of the system, an innovative technique is presented [44]. The classification of segmented images is used to facilitate the search speed [45]. An efficient scheduling arrangement with the combination of a reinforcement algorithm significantly reduces time [46]. The massive amount of videos can be processed using cloud computing technologies [47]. Cloud computing-based solution for education institutes in proposed [48, 49]. Big data processing and analysis can also be useful for this purpose. Big data get problems when varieties of data sources are encountered. This problem can be resolved using transformation techniques; big data transformation algorithms are presented in [50]. Big data transformation solves the problems of heterogeneity, and selection of accurate and robust features is the central component of useful content-based image information retrieval using shape information of objects [51].

2.2 Real-time computer vision and human–computer interaction

The real-time aspect is critical for the analysis of video lecture files and visual inspection of the students. It is seen in the literature that usually that an image processing algorithm is developed and theoretically it is proven very sound, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are never fully explored [52]. This happens because it is presumably designed with a specific application in front. The current study is thus intended to bridge the gap between the theory and practice of image and video processing to design, implement, or utilize real-time image and video processing systems [23].

A good work has been done from the last decade that utilizes numerous kinds of visual features and also the combination of these features [38, 42, 44, 45, 51, 53]. Image segmentation based on skin color is also used by many researchers in the literature. At the same time, many color spaces have also been added in the literature [45]. Some popular color spaces are YUV, YCrCb, HSV, RGB, etc. Color spaces [45] professionally separating the luminance components of the color from the chromaticity are usually understood well by the researchers [45]. In the literature, it is also common that contour extraction is used that is based on edge detection effects in a large number of edges that belong to the detecting objects but also to irrelevant background objects [44]. Significant work has been carried out on finding the objects in grayscale images based on their texture and appearance. Motion is an indicator, which is utilized by a few approaches to detection. Since it assumes that the only motion in the image is due to movement, the purpose is that motion-based detection demands a very controlled setup. The detection method can be used for tracking as well, because it is fast enough to operate at image acquisition frame rate [54]. Methods for tracking and detection are very similar to each other [44]. These methods extremely limit the image search to keep the record of the object that was detected [55]. The implicit guess for the method to succeed due to frequent accusation of images [53]. Feature tracking is also another way that is found in the literature of computer vision [21]. This type of interaction enables the users of e-learning to interact with data in the way that is more natural [56, 57].

There are a number of software platforms and libraries available to the programmers to build the systems capable of real-time analysis and scoring systems for the video and image processing [52]. The list includes MATLAB [58], OpenCV [59], A Forge.NET Framework (www.aforgenet.com/framework), and iGesture (www.igesture.org). OpenCV is widely used to develop the software applications for computer vision and human–computer interaction [60]. C++ is the natively supported language for it [61], while support for other programming languages is also available, e.g., Java, C# (EmguCV [62] is a wrapper library of OpenCV), and python. In the next section, the algorithms are presented for video processing to calculate visual engagement and visual attention. Face detection and tracking algorithms exist, but no intelligent algorithms to measure and analyze teaching performance and student experience. In the same way, eye movement datasets are available but no real-time measurement and analysis algorithms present in the literature. The principle goal of this methodology is to enhance student learning and teacher performance by improving student visual attention and learning experience and enhancing teacher visual engagement and teaching performance. In this study, we have developed real-time imaging-based assessment model. The assessment model tracks student’s visual attention by using our developed algorithm, i.e., real-time visual attention-scoring algorithm. In the same way, teacher’s visual engagement is tracked using our developed algorithm, i.e., real-time visual engagement-scoring algorithm.

3 Real-time imaging-based assessment model

The visual reception takes place at the retina, which gives an image, the color, and shadow. The visual cortex in the brain interprets the image to extract context, semantic, and form [63], while the image is transferred in the form of neural impulses and then transmitted through the optic nerve to the rest of the brain for processing. This study aims to assess the visual engagement of the video lectures. This analysis can be useful for the presenter and student to find out the overall visual attention of the videos [44]. For this purpose, a new algorithm and data collection module are developed. Videos can be transformed into a dataset with the help of data collection module. The dataset is prepared by extracting the image frames from the video and marking them with a number of faces [38], the number of eyes, the status of eyes, and the engagement score along with nominal values of engagement level as shown in Fig. 1. In case of students, the video stream is taken from webcam, and in the case of the teacher, the video stream will be taken from video file. Frame-by-frame analysis is performed in real time. This process continues until the last frame in the video stream is present. Score of each frame is appended in the dataset.

Fig. 1
figure 1

Real-time video processing model

Histogram equalization stretches out the range of grayscale value frequencies. This technique improves the contrast in an image, in order to spread out the intensity range, and detection is made easy for the computer vision algorithms [64].

3.1 Tracking student’s visual attention

The first step in visual attention scoring is the detection of faces and eyes and the segmentation of the face and eyes from the image. The process of segmentation is central since it separates the scoring portion of the image for the calculation from the image background, so that it can pass it to the subsequent recognition and tracking phases as shown in Algorithm 1. Video processing in real time is done in this algorithm. Image is extracted one by one from the video. The extracted image is converted to a grayscale image, and this image is enhanced for the correct detection of face and eye. The detection of face and eyes is performed using multiscale cascade classifier. After the detection of faces and eyes, the process of scoring is started. If no face is detected, then zero is inserted for all attributes, and in the case of face detection, the score depends on the number of faces detected. The score calculation is performed using Eq. (1). Same is the way of score calculation for detected eyes. The number of faces and eyes depends upon the number of persons in the video. Face and eye ratio helps to determine the size of the eye in the image. This ratio further helps to determine either the eye is open or closed by calculating Hough circles. If no circle is found, that means the eye is closed, and in the case of any number of circles, the eye is open. Visual attention score is calculated using Eq. (5). Here constant 0.5 is taken because of the eye and face ratio, i.e., one face has two eyes. The constant 2 is taken because if one face is detected, it will normalize with respect to eyes detected. The level of attention is determined by the calculated score. The steps of the Algorithm 1 are self-explanatory.

figure f

Teachers can use the data to observe students’ activities and extract the patterns to recognize and categorize the problems on an individual basis as well as a whole. Although the results extracted from the data are usually postures server problems, they can be portrayed in a proper and suitable manner if student teacher comparison-based studies are conducted. The learning management systems usually display the information using very basic charts or in the text format [1]. The massive amount of data can be well organized to structure the semantics. So, the inference can be performed to extract the learning patterns and measure teaching performance. Moreover, very restricted interactivity is generally presented by these monitoring features, which can be overcome with the proposed methodology. Recorded video lectures and students attending those lectures remotely provide video files and video stream [65]. Both of the video sources can be analyzed to extract the patterns and to verify the effectiveness of the video lecture as well as attentiveness problems being faced by the students.

3.2 Tracking teacher’s visual engagement

Video lecture has many instructive abundant potentials to make learning and teaching situation better if it is adopted with a proper way. Therefore there is a need to integrate technology and make novel prospects for learning and teaching. Video lecture development involves detailed implementation scheduling. The understanding philosophical perspective of pedagogy is a criterion for effective learning content. According to [66], the constant learning through the use of video lectures has the benefit that knowledge might get directed to the students. Preparing the video lectures and taking additional information from it will also seem to motivate medium’s usability as shown in algorithm 2.

figure g

Frame-by-frame analysis is performed, which are extracted from the video stream. The extracted images are converted to the grayscale image; histogram equalization is applied to correct the contrast and brightness of the image [64]. The enhanced image is used to detect the face in the first stage, and in the later stage, the eyes detection is performed. Histogram equalization is an essential part of the image correction due to the dynamic variations of light and environment. This also reduces the chances of misdetections and false detections. The actual images used during the method are shown in Fig. 2. Image is extracted from the video stream of a video lecture, which is shown in the upper left-hand side. The image is converted to grayscale, and then histogram equalization is applied to make the contract and brightness better. As the face is found, the face portion is segmented and converted to grayscale and histogram equalization is again applied. The eye is searched in the face image, it is cropped, and Hough circles are drawn on a blank image, which indicates that the eye is open. This process is repeated for each frame extracted from the video stream.

Fig. 2
figure 2

Real-time image processing using computer vision approach along with processed frame results

During the study seven videos, lectures were analyzed. The videos were of various lengths. Different frame rates were achieved depending upon the number of detections, i.e., number of faces or eyes [54]. The summary of the processed videos, the number of total frames processed, min, max and the average number of frames processed per second and number of detections are shown in Table 2.

Table 2 Statistical summary for detections of faces and eyes with performance results

4 Results and discussion

We evaluated the technique using a laptop machine. The specifications of the machine were HP ProBook 4540 s with an Intel Core i5 (3210 M, 2.50 GHz) third-generation processor, HP HD Webcam run at a resolution of 1366 × 768. The machine was having 5 GB of RAM. Microsoft Visual Studio 2013 along with OpenCV 2.4.10 and EmguCV Windows universal cuda 2.4.10 was used.

Fig. 3
figure 3

Processed frames per second for each video lecture as per the frame rate

Learning analytics (LA) for the e-learning paradigm has many advantages for the students as well as for the teachers. This process is completed by following the steps such that, first of all, the data are captured, information is extracted, the prediction model is applied to the data, steps are taken after taking the recommendations from the expert system, and then the whole process is refined. The implications of such type of study can be further found in the signals project [1]. Video lecture files are processed using the real-time imaging-based assessment model [52], and results are shown in Table 3. The results are extracted using Algorithm 1 as shown in Fig. 3.

Table 3 Summary of processed video files using Algorithm 1

Aestheticism of video lecture for the student is the presence of teacher’s face. Moreover, the eye contact of student and the teacher in the video plays an important role for the better interaction. The object of aesthetic evaluation could first be the program itself, which is available to the student for interaction, data model, and algorithm. Our model and technique are to enhance the teaching performance and to improve students’ quality of experience by designing the software interface and human–computer interaction. The processed frames per second with an average number of frames per second and trend line and error bars are shown in Fig. 4. Summary of the performance results for each processed video files shows that minimum 5 frames per second are process. Thick black dotted line shows the trend of mean value of PFPS, i.e., 5 for each video. Upper dotted line shows mean of maximum processed frames per second, i.e., 14.

Fig. 4
figure 4

Frames processed per second with max, mean, and error bars. Starting from the left is video file 1, and rightmost is the video file 7, respectively

Real-time imaging is useful for the assessment and improvement of teaching performance and student experience in e-learning. The next section is the conclusion and future work of the study.

5 Conclusion and future work

The conclusions derived in this research article might produce a wider likeness on the student and teacher in the acceptance and all-time use of multimedia technologies. This is expected that the developments made during this study will motivate such future research efforts. Cross-sectional research designs are also needed to extend the validity of the findings. This type of multimedia real-time analysis creates new opportunities to rethink about interaction design and instructional design. Due to an interactive system, the video constructs are designated within the computer by a set of rules [53]. Analyzing and recording the interactions in real time, the rules are applied in the system as it learns. The system can adapt its behavior in the light of its experience with students interacting with the video lectures. A video presentation that is focused on a specific topic can be more helpful for the students [23]. It can be used for the preparation of class for the deep exposure of the subject. Visual engagement measure earlier to attend the lecture in formal or informal teaching is more useful. This innovative method offers the student to prepare before time. It will also improve the student tendency toward the self-learning. It provides the opportunity to the learners to keep up their learning pace. Reusability of the learning material is another benefit of this method. Consolidating diverse information sources to make claims about student learning is well established and loaded with challenges in assessment.

This research has the following future directions. First, the analysis is performed on selected video lectures, and qualitative analysis is performed. Second, the selected number of students’ data are analyzed qualitatively. Future research may be considered to collect the data, perform the quantitative analysis, and investigate the relationships between the teacher’s engagement and student’s attentiveness. The relationships among the possible states of students’ visual attentiveness and teachers’ visual engagement are required to be investigated by using quantitative research analysis methodology.