Keywords

1 Introduction

With the rapid development of online video platforms and intelligent educational systems like Coursera and Khan Academy [12], an enormous amount of students and knowledge seekers browse educational videos to consolidate their understanding of courses and broaden their horizons. Knowledge concepts prediction for educational videos is a fundamental task and very promising for organizing and managing educational videos with great quantity and diversity.

Figure 1 shows an example of a math video and related knowledge concepts with part of the knowledge structure. The video consists of multiple frames and a series of closed captions, and can be split into three different sections, i.e., introduction, problem solving, conclusion. In the last section, the lecturer refers to the problem and reviews the problem-solving process again, which demonstrates a common characteristic that educational videos are combined with sections (such as introduction, concept explanation, analysis, conclusion), and draws the importance of considering context of sections. As a key element of education, knowledge concepts are usually in the form of tree or Direct Acyclic Graph (DAG). As shown in Fig. 1, if we take the root node as level 0, sub-concepts are separated into two different routes from level 2, which describes Hierarchical Multi-label Classification (HMC). This type of problem has drawn more attention in industry and education with the trend of disciplinary crossover.

Fig. 1.
figure 1

An example of a math video from Khan academy and its related hierarchical knowledge concepts.

In the literature, prior works on video classification [3, 4, 22] have achieved great success. Most of these works mainly focus on relatively short video clips and recognize human actions and objects, while long-term video understanding has not been explored a lot yet. For long videos, prior studies [30] choose to evenly or randomly sample certain frames, or detect shot-boundaries [31] to break down whole videos into sections. Philip et al. [12] studied different types of educational videos and how video production decisions affect student engagement. Typical styles of educational videos include classroom lectures, slide presentations, “talking head" shots of an instructor and digital tablet drawings. Long-term content and more complex composition structure make the above strategy ineffective in educational videos. In addition, most recent HMC works [14] combine local and global approaches, and utilize hierarchical dependencies in the form of a feed-forward network. However, these studies fail to model explicit inter-level hierarchical constraints, and are currently limited to textual content.

In summary, there are the following challenges: (1) How to make use of multi-modal information from frames and subtitles. (2) How to consider finer-grained characteristics of educational videos that are relatively long, such as the section-level contexts. (3) How to effectively split educational videos. (4) How to explicitly model inter-level constraints in hierarchical knowledge structure.

To tackle the above challenges, we propose a novel framework named Spotlight Flow Network (SFNet). Specifically, we adopt a text-to-visual uniform segmentation strategy by utilizing progressiveness within a section and uniformity provided by timecodes of closed captions. Then, we model the mechanism of viewers’ spotlight following the lecturers by leveraging different information from the preprocessing step. We also utilize explicit inter-level constraints of the hierarchical knowledge structure and associations between sections and concepts to improve the performance of knowledge concepts prediction. A real-world dataset of 7,521 educational videos is constructed and extensive experimental results address the effectiveness of our proposed method.

2 Related Work

Long-Term Video Understanding. In the literature, there have been many efforts to understand video content [2, 5, 6, 13], including 2D and 3D CNN networks [9, 26, 32], two-stream methods [22], and well-known transformer-based methods [3, 4] in recent years. Most of the prior works mainly focus on relatively short video clips (normally within 30 s) and recognize human actions, objects and scenes, etc., while long-term video understanding has not been explored a lot yet. Donahue et al. [8] proposed an end-to-end recurrent convolutional network for learning long-term dependencies. Wu et al. [31] proposed an object-centric transformer framework that recognizes, tracks, and represents objects and actions of long videos. In summary, most existing studies casually or equally sample certain frames from videos [17, 27] or detect shot-boundaries [31] to breakdown whole videos into sections, yet they cannot apply well on educational videos due to the diversity and complexity of the contents.

Multimodal Video Representation. Aside from visual frames, videos also contain multimodal information such as audio and captions texts, which have complementary semantics and could enhance representation [15, 23]. Shang et al. [19] utilized timestamps of closed captions to incorporate multimodal signals with a short-term order-sensitive attention mechanism. Gabeur et al. [11] developed a transformer-based architecture that jointly encodes different modalities’ appearance by exploiting cross-modal cues. Nagrani et al. [16] added Multimodal Bottlenecks to input of transformer encoder and limited exchange of multimodal data in the middle of self-attention layers, and obtained more effective representation. For educational videos, VENet proposed by Wang et al. [28] exploited the static and incremental characteristics and modeled the fixed reading order of human, yet like other studies, is inadequate to fuse intra-section multi-modalities at a fine-grained level, which is emphatically concerned in our framework.

Hierarchical Multi-label Classification. There have been efforts for HMC in the literature [1, 10]. Flat-based methods ignore the hierarchical structure and only leverage the last level. Local approaches adopt classifiers for each hierarchy, while global methods predict all classes with a single classifier. Recently, many hybrid methods that combine both the local and global manner have been proposed. Sun et al. [24] transformed the label prediction problem to optimal path prediction with structured sparsity penalties. Shimura et al. [21] addressed the data sparsity problem that data from the lower level is much sparser than that from upper levels and developed HFT-CNN to optimize. Wehrmann et al. [29] proposed a hybrid method called HMCN while penalizing hierarchical violations. Huang et al. [14] proposed HARNN, an attention-based recurrent network that models the correlation between texts and hierarchy. Recently, Shen et al. [20] presented TaxoClass that utilizes the core classes mechanism of humans. However, most prior studies are limited to texts and not adequate to capture the inter-level constraints of hierarchical structure.

3 Preliminaries

3.1 Problem Definition

The input of our task is an educational video \(V=\{F,C\}\) composed of multiple frames \(F=\{f_1,f_2,...,f_n\}\) and closed captions \(C=\{c_1,c_2,...,c_m\}\), where each frame \(f_i\) is an RGB image in width W and height H, and a caption is made up with texts, start and end timecodes, i.e. \(c_j=\{t_j, tc^{start}_j, tc^{end}_j\}\). Texts of captions can be described as a word sequence \(t=\{w_1, w_2, ..., w_k\}\). The Hierarchical Knowledge Structure is denoted as \(\gamma =(K_1, K_2, ..., K_H)\), where H represents the depth of hierarchy and \(K_i=\{k_1, k_2,...\}\) is the set of knowledge concepts of level i. The Predicted concepts are \(L=\{l_1, l_2, ..., l_H\}\) where \( \forall i \in \{1, 2, 3, ..., H\}\), and \(l_i \subset K_i\). Given an educational video V and the hierarchical knowledge structure \(\gamma \), our goal is to predict the knowledge concepts L for the video.

3.2 Text-Visual Uniform Section Segmentation

Unlike previous works [31] that detect shot boundaries of visual frames and then guide the segmentation of captions, we preprocess sequential frames and closed captions by exploiting timecodes of captions. We first complement closed captions for videos using ASR (Automatic Speech Recognition) tools. We observe that educational visual content within a section is progressive and later frames tend to contain more information. Thus, inspired by Adaptive Block Matching (ABM) [28] and Dynamic Frame Skipping [18], we develop an efficient section segmentation strategy that fits well in educational videos:

  1. 1.

    Select the center frames of timecode gaps as candidates of sections.

  2. 2.

    Merge sections. Replace adjacent candidates within \(t_{min}\) by the latter ones.

  3. 3.

    Calculate the difference matrix diff and score \(\sigma \) of all adjacent candidate frames by pixel-wise value subtraction.

  4. 4.

    Merge sections if corresponding difference score \(\sigma \) is less than threshold \(\theta _{min}\).

  5. 5.

    Calculate all ABM scores \(\delta \) for adjacent candidates if difference score \(\sigma \) is greater than threshold \(\theta _{max}\).

  6. 6.

    Select top \(n_{sections}\) candidates of \(\delta \) as the keyframes representing each section with uniform pairs of caption and difference matrices between sections.

It is worth noting that the ABM score is calculated by dividing two frames into patches and measuring how the latter patches cover the previous ones. The difference score \(\sigma \) of the k-th candidate can be expressed as:

$$\begin{aligned} \sigma (k) = \frac{1}{W*H}\sum _{i=0, j=0}^{i<W, j<H}|f_{ij}^{k+1}-f_{ij}^k |, \end{aligned}$$
(1)

where \(f_{ij}^k\) denotes the scaled pixel value of the k-th candidate frame. As a result, each input video is split into fixed number of sections. Each section comprises a keyframe and several uniform pairs of difference matrix and caption texts, serving the modeling of fine-grained spotlight flow within section.

4 Spotlight Flow Network

In this section, we introduce the details of SFNet, as shown in Fig. 2. We will discuss the two main parts, especially present the modeling of the Spotlight Flow Mechanism and specify the loss function used to train the model.

Fig. 2.
figure 2

The SFNet framework.

4.1 Multimodal Representation Layer

In the first stage of SFNet, we aim to represent each section by encoding multimodal data and modeling Spotlight Flow Mechanism, and obtain video-level representation. The input of each section is a keyframe and several uniform pairs of difference matrices and caption texts. We first utilize a variant of ResNet [25] to extract keyframe feature \(r^f \in \mathbb {R}^{d_1}\). A base version of BERT is used to get sequential semantic vectors \(r^{cs} \in \mathbb {R}^{t \times d_1}\) for all captions within the section, where t denotes the number of diff-caption pairs.

Fig. 3.
figure 3

Multimodal Representation Layer.

Spotlight Flow Attention (SFA). We observe that lecturers tend to conduct viewers to focus on certain visual regions. Content that periodically comes out or is regularly referenced by underlines, circle drawings, etc., strongly indicates the correlation of different time periods and connects time and moving regions. Thus, SFA is designed to model the above mechanism. Inspired by I3D [7], we inflate the feature maps from the middle of the backbone and get \(r^{mid} \in \mathbb {R}^{t \times d_2 \times w \times h}\). We resize difference matrices with interpolation and apply element-wise multiplication as follows:

$$\begin{aligned} r^{flow}_{(i, j)} = r^{mid}_{(i, j)} \cdot diff_{(i, j)}, \end{aligned}$$
(2)

and through the latter part of the feature extractor, we get the corresponding features of moving regions \(r^{flow} \in \mathbb {R}^{t \times d_1}\). Then SFA can be formulated as:

$$\begin{aligned} r^{c}_{att} = SFA(r^{flow}, W_{sf}, r^{cs}) = softmax(r^{flow} \cdot W_{sf}) r^{cs}, \end{aligned}$$
(3)

where matrix \(W_{sf} \in \mathbb {R}^{t \times d_1}\) is the hidden matrix. Considering the association between the sequential captions, we utilize Bi-LSTM that is capable of learning dependencies across the sequence forward and backward at the same time. We input \(r^{c_{att}}\) and \(r^{flow}\) with the same size on temporal dimension, and the final representation of caption \(r^c\) is calculated by average pooling the hidden state.

Caption Frame Attention (CFA). We propose CFA by taking the correlation between captions and related parts of the visual content across time. We exploit CFA by using the hidden states of Bi-LSTM h as the query of attention input:

$$\begin{aligned} r^f_{att} = avg( CFA(h, r^f, r^f) ) = avg(\frac{softmax(h \cdot r^f)}{\sqrt{d_k}} {r^f}), \end{aligned}$$
(4)

where avg() denotes the average pooling operation, and \(d_k\) represents the scaled factor. Therefore, the representation of the section is as follows:

$$\begin{aligned} r^{sec} = r^c \oplus avg(r^{diffs}) \oplus r^f_{att}, \end{aligned}$$
(5)

and the final output of MRL is calculated by inputting all the section-level features to a video level Bi-LSTM to model the correlation among sections.

4.2 Hierarchical Multi-label Inter-level Constrained Classifier

Fig. 4.
figure 4

Inter-level constrained unit.

Since we have obtained the multimodal representation of the video \(V \in \mathbb {R}^{n \times d}\), where n is the number of sections, a Hierarchical Multi-label Inter-level Constrained Classifier (HMICC) is proposed to predict knowledge concepts for educational videos based on the feed-forward manner of current hybrid methods. The network consists of several Inter-level Constrained Units (ICU) shown in Fig. 4. Each one utilizes Section-Concept Attention (SCA) and Inter-level Constrained Matrix (ICM) to model each level’s dependencies and feed the hidden information to the next unit. Specifically, \(S^i \in \mathbb {R}^{C^i \times d}\) denotes the hidden representation of the i-th level and is input to SCA together with V. We apply the dot-product scores to measure the similarity of categories and video sections:

$$\begin{aligned} \begin{aligned} V_{att}&= softmax(S^i \cdot V) \cdot V, \\ r_{att}^v&= avg(V_{att}), \end{aligned} \end{aligned}$$
(6)

where we operate average pooling on temporal dimension to get \(r_{att}^v\). Then we concatenate \(r_{att}^v\) and the previous hidden state \(h_{i-1}\) to obtain \(h_i\) by:

$$\begin{aligned} h_i = \varphi (W_h (r_{att}^v \oplus h_{i-1}) + b_h ), \end{aligned}$$
(7)

where \(\oplus \) denotes concatenation. Here we adopt the Inter-level Constrained Matrix \(ICM_i \in \mathbb {R}^{C^i \times C^{i+1}}\) with each \(icm_{jk}\) representing the influence of the j-th category on the k-th one to the next level. We initialize all ICMs by calculating the conditional probabilities from the training set. The result of product between ICMs and previous prediction is added up to get the local prediction through a hidden layer, and the global output is obtained by inputting the last hidden state through a fully-connected layer:

$$\begin{aligned} \begin{aligned} P_L^i&= \sigma (W_L ((P_L^{i-1} ICM_i) \oplus h_i) + b_L), \\ P_G&= (W_G \cdot h_H + b_G), \end{aligned} \end{aligned}$$
(8)

where \(W_h\), \(W_L\) and \(b_H\), \(b_L\) are weight matrices and bias vectors. Therefore, we can calculate the final predictions P with a parameter \(\beta \in [0, 1]\) for balancing the local and global outputs:

$$\begin{aligned} P = \beta \cdot P_G + (1-\beta ) \cdot (P_L^1 \oplus P_L^2 \oplus , \dots , \oplus P_L^H), \end{aligned}$$
(9)

4.3 Training SFNet

In this section, we specify a hybrid loss function for training SFNet to learn both global and local information. We calculate the global loss(\(\mathcal {L}_G\)) and the local loss(\(\mathcal {L}_L\)) for each hierarchical level, which can be formulated as:

$$\begin{aligned} \mathcal {L}_G = \varepsilon (P_G, Y_G), \mathcal {L}_L = \sum _{h=1}^{H}\varepsilon (P_L^h, Y_L^h), \end{aligned}$$
(10)

where \(Y_G\) denotes the binary label vector for all categories of the knowledge structure and \(Y_L^h\) contains only the categories of the h-th level. We utilize the binary cross-entropy loss as \(\varepsilon (\hat{Y}, Y)\) and formulate the final loss function as:

$$\begin{aligned} \mathcal {L}(\Omega ) = \mathcal {L}_L + \mathcal {L}_G + \lambda {||\Omega ||}^2, \end{aligned}$$
(11)

where \(\Omega \) denotes the parameters of SFNet and \(\lambda \) is the hyper-parameter for L2 regularization. Thus, we can train SFNet by minimizing the loss function \(\mathcal {L}(\Omega )\).

5 Experiments

5.1 Data Description

To evaluate the performance of our framework, we construct the dataset by collecting 7,521 educational videos, corresponding closed captions and hierarchical knowledge concepts from Khan AcademyFootnote 1 The dataset involves a three-level hierarchical knowledge structure with 6, 42, 351 concepts in each level, and 399 in total. Averagely, a video is 436.4s long and has 1151 words of captions.

5.2 Baseline Approaches and Experimental Setup

We compare our proposed model with state-of-the-art works including unimodal and multimodal approaches. It is worth noting that all baseline models are pretrained on ImageNet, Kinetics dataset, etc., according to the categories, and tuned to obtain the best results.

  • R3D [26] is a deep 3D convolution network with residual connection across layers and enables a very deep network structure while retaining performance improvement.

  • SlowFast [9] is a two-stream 3D CNN network that consists of two different paths that separately focus more on temporal and spatial information.

  • TimeSformer [4] is a video transformer network that uses frame patches with positional encoding as input and exploits divided spatial and temporal self-attention.

  • R3D+BERT is the combination of R3D and BERT. We leverage BERT to obtain the feature of captions and fuse the visual feature from frames.

  • HMCN-F [29] is a feed-forward network that models the top-down hierarchical relationship and optimizes both local and global performance with penalties of hierarchical violations.

We implement all the methods using Pytorch. To train SFNet, we first set \(n_{sections}\) as 8 and the maximum length of words for each caption as 64. We use ResNet34 and BERT-base as the feature extractor backbones and set the output dimension to 256. Hidden sizes of Bi-LSTM and HMICC are 128. We use the Adam optimizer and set up the initial learning rate to 0.0005 with cosine annealing scheduler that periodically adjust the value to 0.00005 for every 60 epochs. We also set \(\beta = 0.5\), \(\lambda = 0.00005\) and dropout rate as 0.5 to mitigate over-fitting. We used Precision, Recall, \(F1-score\), and mAP (mean Average Precision) as criteria for performance comparison. Whether a model considers the knowledge hierarchy or not, we calculated the performance at each hierarchical level and globally as well to further compare the differences.

Table 1. Performance Comparison on khan academy dataset. V and T denote the visual and textual modalities of the input data.
Fig. 5.
figure 5

Performance of SFNet and baseline models on different hierarchical levels.

5.3 Experimental Results

Performance Comparison. From the results shown in Table 1 and Fig. 5, we can get several observations. First, models with textual input tend to outperform those visual-only models. In educational videos, visual content serves the lecturers’ explanation. Due to the complexity and variance of visual elements such as hand-drawn graphics, it is harder to understand the semantics than textual content. It also indicates the significance of spotlight flow attention. Second, it is obvious that the performance decreases as the level gets lower. Hierarchical structure has a natural identity that higher levels have fewer categories and more data, which might explain the step down of performance. The results show that SFNet is more efficient by considering the inter-level association.

Ablation Study. To further assess how each part of our model donates to the performance, we remove each key module once at a time and construct several variants of SFNet. In Table 2, all the key modules do have contribution to better-predicting performance. The greater difference indicates more impact of the removed module. In addition, the variant without textual input has the greatest performance drop, which once again showing the above characteristics.

Table 2. The results of ablation study. V and T represent visual and textual input.

6 Conclusion

In this paper, we presented Spotlight Flow Network to predict knowledge concepts for educational videos. We first adopted an effective text-to-visual section segmentation strategy for educational videos. Then, with different information paired with captions, we modeled the Spotlight Flow mechanism in which lecturers tend to conduct viewers’ attention and moving regions help build up space-time connection. We also designed the HMICC to predict hierarchical knowledge concepts with implicit progressive impact and explicit inter-level constraints.