1 Introduction

Student behavior to a certain extent reflects the volume of knowledge acquired during class sessions [1]. Therefore, in traditional education, teachers need to constantly monitor student behavior while teaching, to adjust the pace and methods of instruction. However, in actual classroom environments, a single teacher often faces dozens or even more students, and the large number of students can lead to a situation where the teacher lacks sufficient energy to observe student behavior while teaching. Thus, if an accurate and real-time method of detecting student behavior could be utilized to replace the teacher’s observation tasks, it would allow teachers to focus more on the teaching itself.

In the current era of educational informatization and intelligence, classroom student behavior detection, as an emerging instructional aid, is increasingly gaining widespread attention from the academic community and educational practitioners. Traditional classroom behavior analysis primarily relies on manual observation, with a common practice being the analysis of student behavior through classroom video recordings [2]. Due to the large volume of videos, manual processing can lead to fatigue and low efficiency, consuming a significant amount of human resources [3]. Moreover, it cannot provide real-time feedback to teachers, which limits its impact on improving teaching effectiveness.

With the rapid development of deep learning technology, real-time and accurate detection of student behavior in the classroom has become feasible. Compared to traditional methods, target detection methods based on deep learning can automatically learn feature data from a large volume of video data, overcoming the limitations of manual feature extraction [4]. Deep learning-based target detection methods offer deeper insights into student learning situations in the classroom, while also reducing the pressure on teachers in classroom supervision, thus better enhancing the quality of courses.

Deep learning-based object detection methods are mainly categorized into two types: The first type includes two-stage detection algorithms, where models first generate proposal boxes and then classify them using deep convolutional networks. Common models include R-CNN [5], Fast R-CNN [6], and Faster R-CNN [7]. Although two-stage algorithms are more accurate, they are slower, less capable of real-time processing, and their large model sizes make deployment challenging. The second type includes one-stage target detection algorithms based on regression, which calculate category probabilities and location information simultaneously. These models offer a balance between accuracy and computational speed, with smaller sizes facilitating deployment in practical applications. Common algorithms include YOLO [8], RetinaNet [9], and SSD [10], with the YOLO series being the most widely applied one-stage detection algorithm [11], where YOLOv8 represents the best balance between network lightweighting and detection accuracy in the YOLO series.

While the YOLO algorithm has found applications in various areas, its deployment for student behavior detection in classrooms still faces significant challenges. Additionally, due to camera resolution limitations, the pixel count representing each student’s body in the image is typically very low. Moreover, the significant size discrepancy between students sitting at the front and those at the back in the classroom scenes leads to the presence of multi-scale targets [12]. Addressing these issues necessitates a larger, more precise model. Yet, most schools lack the necessary equipment to run such complex models, necessitating a model that is both lightweight and capable.

This paper designs and proposes a model based on an improved version of YOLOv8n, CSB-YOLO, specifically tailored for real-time detection of student behaviors in classroom settings. CSB-YOLO boasts minimal parameters and computational requirements, making it highly deployable on low-performance devices in schools due to its low device demands and satisfactory accuracy. The main contributions of this paper are as follows:

  1. 1.

    This paper introduces the BiFPN structure [13] into the YOLOv8 model, with the aim of enhancing its capability to detect densely distributed small targets while reducing both computational requirements and parameter count.

  2. 2.

    We have devised a novel Efficient Re-parameterized Detection Head for YOLOv8, replacing the original detection head structure. This modification significantly reduces model complexity, accelerates inference speed, thereby enhancing real-time performance. Additionally, we have enhanced the C2f module by incorporating SCConv [14] to compensate for any potential accuracy loss resulting from the lightweight design of the detection head.

  3. 3.

    The paper employs LAMP pruning [15] to optimize the model’s structure, significantly reducing both parameter count and computational load, making it more suitable for deployment on low-performance devices. Moreover, to prevent a reduction in model accuracy due to pruning, the pruned model undergoes BCKD knowledge distillation [16]. This combination of pruning with knowledge distillation achieves nearly lossless lightweighting of the model.

2 Related works

Behavior detection of human targets remains a hot topic in the field of object detection, but behavior detection in classroom settings poses many challenges, such as detecting multi-scale dense targets and the need for lightweight models in practical applications. This section will explore and analyze solutions to these challenges.

Zhang et al. [17] established a dataset for the behavior of students raising hands in classroom environments and discovered that information loss occurred due to the reduction of channel numbers during the construction of the feature pyramid [18]. By applying Spatial Context Augmentation (SCA) and multi-branch feature fusion modules, the precision of hand-raising detection was enhanced. However, the complexity of the network structure may compromise real-time performance.

Wang et al. [19] introduced a method to detect yawning in classroom environments, integrating the feature pyramid within R-FCN [20] to tackle issues such as occlusions and low-resolution facial recognition, and employed channel pruning to diminish both the model’s parameter count and computational overhead. While pruning drastically decreases the number of parameters, it inevitably leads to a decline in precision as the pruning ratio escalates, thereby raising a pivotal challenge concerning how to maintain or improve precision amidst the pruning process.

Cheng et al. [21] introduced the concept of a cross-stage local network at the end of the YOLO-v4 [22] network, embedding the Embedding Connection (EC) component to develop an improved YOLO-v4 network for detecting teacher and student behaviors. Bao et al. [23] improved the model based on YOLOv5 by adding a feature fusion layer and incorporating the ghost module [24] to replace standard convolutions, thus enhancing the model’s capability to detect behaviors in the classroom. Wang et al. [25] introduced the CBAM [26] attention mechanism to the YOLOv7 [27] foundation, effectively capturing contextual features and enhancing the network’s feature detection capability, allowing accurate detection of multiple students’ learning behaviors. Cheng et al. [28] improved the C2f structure with Res2Net [29] on the YOLOv8 base, enhancing the network’s ability to extract multi-scale features. They also introduced the EMA [30] attention mechanism in the backbone to address occlusion issues in classroom settings. While the introduction of attention mechanisms significantly enhances the model’s feature learning capability and accuracy, it also significantly increases computational overhead, adversely affecting network lightweighting.

Recent studies, such as Liu et al. [31], have enhanced YOLOv8 by adding a small object detection layer equipped with a dedicated detection head specifically for small objects. Although this dedicated detection head does indeed improve the network’s ability to detect small objects, it inevitably complicates the network and the additional detection head can limit the network’s inference speed. Meanwhile, Xiao et al. [32] have improved network accuracy by incorporating the IMPDIoU loss function into YOLOv8, but this method does not enhance the network’s inference speed.

Analysis of the above research reveals that enhancing the network’s learning capability for multi-scale targets is crucial for addressing the difficulty of detecting multi-scale dense targets. Feature fusion is an important method to enhance multi-scale feature learning capabilities [33], and adding attention mechanisms can significantly improve detection accuracy. However, the increased computational overhead caused by complex network structures remains a problem to be solved. Network pruning is an effective lightweighting method that significantly reduces the network’s parameter count and computational load, but it requires a careful balance between pruning rate and accuracy.

3 CSB-YOLO detection model

Fig. 1
figure 1

The architecture of the CSB-YOLO

3.1 YOLOv8

YOLOv8, an evolution from YOLOv5 by Ultralytics, is a single-stage object detection algorithm. It employs the CSP gradient bifurcation concept and the SPPF module in both backbone and neck parts. Notably, it adopts a more gradient-rich C2f structure over the C3 structure from its predecessor. In the head section, it utilizes a decoupled structure, separating classification and detection heads. YOLOv8 also shifts from the traditional Anchor-Based approach to an Anchor-Free concept, enhancing its lightweight design for deployment on low-performance devices in real-world classroom settings. To further optimize its lightweight nature, this paper builds upon the smallest variant, YOLOv8n.

3.2 Overview of CSB-YOLO

To address the challenge of real-time student behavior detection in classroom scenarios, we devised the CSB-YOLO model, as illustrated in Fig. 1. Initially, to address the challenge of detecting densely distributed student targets in classroom scenarios, we modified the Neck section of YOLOv8 to incorporate the BiFPN structure. Additionally, we engineered an Efficient Re-parameterized Detection Head for the network, replacing the original detection head structure of YOLOv8. This adjustment not only reduces network complexity but also accelerates the inference process, enhancing real-time performance. Furthermore, we devised a C2f_SCConv structure to precisely locate student targets in classrooms, thereby compensating for any potential accuracy loss resulting from lightweight detection heads. To further streamline the network, we applied the LAMP pruning technique, significantly reducing the network’s parameter count and computational load, thereby lowering complexity and facilitating easier deployment. Lastly, for the pruned model, we implemented the BCKD distillation strategy to ensure the model maintains high detection accuracy while remaining lightweight.

3.3 BiFPN

In the Neck section, YOLOv8 utilizes the PANet [34] structure for feature fusion. While PANet, compared to the traditional FPN structure, facilitates bidirectional feature fusion while retaining multi-scale feature information, this fusion mechanism depends on numerous nodes, leading to increased computational and parameter requirements of the network. To reduce the computational and parameter demands without compromising accuracy, we have incorporated the BiFPN [13] structure into YOLOv8.

BiFPN represents an advanced version of the FPN architecture, establishing cross-scale connections through bidirectional channels, where each layer receives features from both higher and lower levels. In contrast to PANet, BiFPN eliminates nodes with only one input and employs weighted feature fusion. Additionally, to combine more features, an extra pathway is introduced, linking input and output nodes of the same level. This design allows the network to better balance semantic and spatial information across different layers, preserving shallow semantic details without sacrificing significant deep semantic information. Thanks to the reduction in nodes, BiFPN significantly decreases both computational and parameter requirements of the network, while its efficient multi-scale feature fusion capability maintains accuracy. Figure 2 illustrates three distinct Neck structures.

Fig. 2
figure 2

The structures of three different Necks

3.4 Efficient re-parameterized detection head

YOLOv8 features three detection heads, each with two paths containing two 3x3 convolutions for feature extraction, resulting in a total of twelve 3x3 convolutions within the detection head section of the network. Although this configuration improves accuracy to some extent, the extensive use of convolutional kernels increases the network’s parameter count and slows down inference speed.

Drawing inspiration from the parameter-sharing approach employed in the detection heads of RetinaNet [9], we redesigned the detection head of YOLOv8. The redesign consolidates the four 3\(\times\)3 convolutions used for feature extraction along the two paths within the original detection head into two, allowing both classification and box regression to share these two 3x3 convolutions. This modification reduces the complexity of the head section and increases the network’s inference speed, inevitably resulting in a slight reduction in accuracy. To minimize the loss in accuracy while simplifying the network, we introduced the Diverse Branch Block (DBB) [35], replacing the original convolutions.

Fig. 3
figure 3

The basic structure of DBB

DBB is a cost-free universal module that utilizes reparameterization techniques, building upon the foundations of ACNet [36] and RepVGG [37] by exploring more equivalent transformations. DBB takes cues from the Inception [38,39,40,41] structure, enriching the feature space of the convolutional block with a multi-branch architecture. During the inference phase, the multiple branches are reparameterized and merged into a single main branch, optimizing the network’s performance while maintaining precision.

Transform I - Convolution with Batch Normalization (Conv-BN): A convolution layer is often equipped with a BN layer which performs channel-wise normalization and scaling.

$$\begin{aligned} O_{j}=\frac{(I * F)_{j}-\mu _{j}}{\sigma _{j}} \gamma _{j}+\beta _{j} \end{aligned}$$
(1)

where \(*\) denotes convolution, I is the input, F is the filter \(\mu _j\), \(\sigma _j\) are the mean and standard deviation for batch normalization, and \(\gamma _j\), \(\beta _j\) are the scale and shift parameters. This transformation fuses batch normalization parameters into the convolution filters for inference.

Transform II - Addition of Branch Outputs: The additivity property of convolution allows the merging of outputs from multiple convolution layers with the same configuration by simply adding their weights and biases:

$$\begin{aligned} F^{\prime }=\sum _{i} F^{(i)}, \quad b^{\prime }=\sum _{i} b^{(i)} \end{aligned}$$
(2)

\(F^{(i)}\) and \(b^{(i)}\) represent the convolution kernels and biases of the ith branch, respectively. \(F^{\prime }\) and \(b^{\prime }\) are the combined kernel and bias.

Transform III - Sequential Convolutions: A sequence of convolutions, typically involving small kernel sizes like 1x1 followed by larger kernels like KxK, can be merged into one effective convolution. The transformation rearranges and combines the weights from sequential layers to form a single layer that encapsulates the collective effect:

$$\begin{aligned} F^{\prime }=F^{(1)} * F^{(2)}, \quad b^{\prime }=F^{(1)} * b^{(2)}+b^{(1)} \end{aligned}$$
(3)

This is especially useful for reducing depth and computational complexity in the network.

Transform IV - Depth Concatenation: Depth concatenated outputs from different branches are merged into a single convolution layer:

$$\begin{aligned} F^{\prime }={\text {Concat}}\left( F^{(1)}, F^{(2)}\right) , \quad b^{\prime }={\text {Concat}}\left( b^{(1)}, b^{(2)}\right) \end{aligned}$$
(4)

\({\text {Concat}}\) denotes the concatenation operation along the channel dimension, allowing multiple branches to combine into a single convolution operation.

Transform V - Convolution for Average Pooling: An average pooling operation is modeled as a convolution with a uniform kernel:

$$\begin{aligned} F_{d, c,,:}^{\prime }=\left\{ \begin{array}{ll} \frac{1}{K^{2}} &{} \text{ if } d=c \\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(5)

d and c are indices for the output and input channels, respectively; K is the size of the pooling (or convolution) window; \(F_{d, c,,:}^{\prime }\) defines each element of the convolution kernel to implement pooling.

Transform VI - Handling Multi-Scale Convolutions: Convolutions with different kernel sizes are unified into a larger convolution operation through appropriate padding:

$$\begin{aligned} F_{\text{ padded } }^{\prime }= \text{ Zero-Padding } \left( F_{\text{ small } }\right) \end{aligned}$$
(6)

\(F_{\text{ small } }\) represents a smaller kernel, which is padded with zeros to match the size of the largest kernel in the transformation, referred to here as \(F_{\text{ padded } }^{\prime }\).

A complete DBB block, as depicted in Fig. 3, consists of four branches. Using the aforementioned six parameter transformation methods, it is possible to convert complex multi-branch structures into standard convolutions and reuse the weights obtained during training. DBB facilitates the separation of training and inference phases: it employs a more complex network structure during training to improve network accuracy and undergoes equivalent transformations during inference to accelerate the inference process.

Fig. 4
figure 4

A structural comparison between ERD Head and the detection head of YOLOv8

The detection head designed in this paper, as illustrated in Fig. 4, replaces the original four convolutions of the detection head with two DBB modules. During the inference phase, these two DBB modules are equivalently transformed into two standard convolution modules. Thanks to the reduction in the number of convolutions, both the network’s parameter count and computational load significantly decrease. This will be validated in the experimental section.

3.5 C2f_SCConv

In classroom environments, student positions are often densely distributed, leading to frequent occurrences of overlap and occlusion. The lightweight detection head inevitably reduces the network’s ability to detect complex human targets. To address this issue, we redesigned the C2f module in the YOLOv8 structure using SCConv [14], increasing the receptive field of the C2f module to compensate for the accuracy loss caused by the lightweight detection head.

The core idea of SCConv is to enhance the fundamental convolutional feature transformation process of CNNs without modifying the model architecture. It essentially employs grouped convolutions for multi-scale feature extraction, dividing them into two groups along the channel dimension. One pathway conducts regular convolutional feature extraction, while the other pathway utilizes downsampling operations to enlarge the network’s receptive field. This enables each spatial location to conduct self-calibrated operations by integrating information from two different spatial scales.

Fig. 5
figure 5

A schematic illustration of self-calibrated convolutions

The workflow, as depicted in Fig. 5, involves input and output channels both of size C, with a given set of filters K shaped as \(\left( C, C, k_h, k_w\right)\), where \(k_h\) and \(k_w\) represent the spatial height and width, respectively. Initially, the process involves segmentation, resulting in four groups of filters \(\left\{ K_i\right\} _{i=1}^4\), each with a shape of \(\left( \frac{C}{2}, \frac{C}{2}, k_h, k_w\right)\). Subsequently, the input X is evenly divided into two parts \(\left\{ X_1, X_2\right\}\). The self-calibration operation is performed on \(X_1\) using filters \(\left\{ K_1, K_2, K_3\right\}\), yielding \(Y_1\). In the second pathway, a simple convolution operation is executed: \(Y_2=F_1\left( X_2\right) =X_2 * K_1\), aiming to preserve the original spatial context information. Finally, \(\left\{ Y_1, Y_2\right\}\) are concatenated to form the final output Y.

In detail, the Self-Calibrated operation starts with applying average pooling to the given input \(X_1\), using a kernel size of \(r \times r\) and a stride of r, denoted as:

$$\begin{aligned} T_1=AvgPool_r\left( X_1\right) \end{aligned}$$
(7)

Next, \(T_1\) undergoes a feature transformation using filter \(K_2\):

$$\begin{aligned} X_1^{\prime }=U p\left( F_2\left( T_1\right) \right) =U p\left( T_1 * K_2\right) \end{aligned}$$
(8)

Here, \(U p(\cdot )\) represents the operation of linear interpolation, mapping the intermediate quantity from a smaller scale space back to the original feature space. The self-calibration operation can then be represented as:

$$\begin{aligned} Y_1^{\prime }=F_3\left( X_1\right) \cdot \sigma \left( X_1+X_1^{\prime }\right) \end{aligned}$$
(9)

where \(F_3\left( X_1\right) =X_1 * K_3, \sigma\) denotes the sigmoid function, and the symbol "." indicates element-wise multiplication. \(X_1^{\prime }\) serves as a residual term to establish weights for self-calibration. The final output after self-calibration can be written as:

$$\begin{aligned} Y_1=F_4\left( Y_1^{\prime }\right) =Y_1^{\prime } * K_4 \end{aligned}$$
(10)
Fig. 6
figure 6

The structure diagram of the C2f_SCConv module

In this study, the C2f structure, improved by SCConv, is illustrated in Fig. 6. We enhanced the BottleNeck by replacing the convolution of the second CBS with SCConv, resulting in a structure we denote as BottleNeck_SCC. Subsequently, stacking n groups of BottleNeck_SCC forms the C2f_SCConv structure. Owing to SCConv’s larger receptive field, C2f_SCConv facilitates more precise localization of student targets within the classroom environment.

3.6 LAMP pruning

Although the enhancements previously implemented have provided the network with a certain level of real-time capability, it is still necessary to further lighten the network, especially considering the device performance within classroom scenarios. Complex neural network models often contain redundancies after training. These redundancies, which are relatively less important, do not substantially improve network accuracy but increase the network’s parameter size, thus slowing down inference time.

We adopted a pruning method based on LAMP (Layer-Adaptive Magnitude-based Pruning) scores [15]. LAMP is designed to automatically select the optimal level of sparsity among layers in a neural network to achieve the best balance between model performance and sparsity. Using LAMP scores, an adaptive, global pruning strategy can be implemented, eliminating the need for manual adjustment of hyperparameters.

The LAMP process is as follows: For a feedforward neural network with depth d, where each fully connected/convolutional layer is associated with a weight tensor \(w^{(1)}, \ldots , w^{(d)}\). For fully connected layers, the corresponding weight tensor is a two-dimensional matrix; for 2D convolutional layers, the corresponding tensor is four-dimensional.

To unify the definition of LAMP scores for both fully connected and convolutional layers, weight tensors are flattened into one-dimensional vectors. According to a given index mapping, weights are sorted in ascending order, meaning that for \(u<v\), it always holds that \(|W[u]| \le |W[v]|\), where W[n] denotes the element in W mapped from index u. The LAMP score for the u-th weight index W is:

$$\begin{aligned} score(u ; W):=\frac{(W[u])^2}{\sum _{v \ge u}(W[v])^2} \end{aligned}$$
(11)

The LAMP score evaluates the relative importance of a target connection amongst all the surviving connections within the same layer. After calculating the LAMP scores, the connections with the lowest scores are globally pruned until the required global sparsity constraint is achieved. It is important to note that within each layer, there is a single connection with a LAMP score of 1, which is the maximum possible LAMP score. This ensures that at least one connection is preserved in every layer. This process essentially constitutes minimum magnitude pruning with layer-wise sparsity levels chosen automatically, obviating the need for intricate parameter settings.

3.7 BCKD knowledge distillation

BCKD [16] is a new knowledge distillation method tailored for dense object detection tasks, designed to address the inefficiency of traditional classification distillation in such contexts. In dense object detection, the pronounced imbalance among foreground categories leads to traditional softmax-based knowledge distillation methods overlooking the absolute classification scores of each category. This oversight can result in the optimal solution for the distillation loss function not necessarily ensuring the best classification performance in the student model. The fundamental principle of BCKD is to reframe the multi-class classification challenge into multiple binary classification tasks, applying knowledge distillation to each binary classification task individually.

Fig. 7
figure 7

The workflow of BCKD knowledge distillation

The distillation process of BCKD is illustrated in Fig. 7. It incorporates two novel distillation loss functions specifically designed for object detection tasks: (i) Binary classification distillation loss, \(L_{cls}^{dis}\), which represents the classification logits as multiple binary maps and extracts classification knowledge through a distillation loss resembling binary cross-entropy; (ii) IoU-based localization distillation loss, \(L_{loc}^{dis}\), which transfers localization knowledge from the teacher model to the student model by calculating the IoU values between the bounding boxes predicted by the two models and employing IoU loss.

The Binary Classification Distillation Loss aims to address the severe imbalance between foreground and background categories in dense object detection. This approach tackles the challenge by transforming the multi-class problem into multiple binary classification tasks.

Classification Scores: For each location i and category j the classification logits \(l_{i j}\) are transformed into classification scores \(p_{i j}\) using the Sigmoid function:

$$\begin{aligned} p_{i j}=\frac{1}{1+\textrm{e}^{-l_{i j}}} \end{aligned}$$
(12)

Binary Cross-Entropy Loss: For each sample x, the classification loss \(L_{c l s}(x)\) is computed as:

$$\begin{aligned} L_{c l s}(x)=\sum _{i=1}^n \sum _{j=1}^K L_{C E}\left( p_{i j}, y_{i j}\right) \end{aligned}$$
(13)

where \(L_{C E}\left( p_{i j}, y_{i j}\right)\) is the binary cross-entropy loss, \(y_{i j}\) is the true label, n is the number of samples, and K is the number of categories.

Distillation Loss Weighting: Given the classification scores \(p_{i j}^t\) from the teacher model and \(p_{i j}^S\) from the student model, the binary classification distillation loss \(L_{c l s}^{d i s}(x)\) is:

$$\begin{aligned} L_{c l s}^{d i s}(x)=\sum _{i=1}^n \sum _{j=1}^K w_{i j} \cdot L_{B C E}\left( p_{i j}^s, p_{i j}^t\right) \end{aligned}$$
(14)

where \(w_{i j}\) are weights determined based on the score differences, intended to focus the learning on important samples.

The IoU-based Localization Distillation Loss is designed to enhance localization performance by calculating the IoU between bounding boxes predicted by the teacher and student models.

IoU Calculation: Given the bounding boxes \(b_i^t\) and \(b_i^s\) predicted by the teacher and student models, respectively, their IoU \(u_i^{\prime }\) is computed as:

$$\begin{aligned} u_i^{\prime }=IoU\left( b_i^t, b_i^S\right) \end{aligned}$$
(15)

Localization Distillation Loss: The IoU-based localization distillation loss \(L_{l o c}^{d i s}(x)\) is formulated as:

$$\begin{aligned} L_{l o c}^{d i s}(x)=\sum _{i=1}^n \max \left( \omega _{. j}\right) \cdot \left( 1-u_i^{\prime }\right) \end{aligned}$$
(16)

where \(\max \left( \omega _j\right)\) represents the maximum weight among categories, emphasizing crucial localization information.

Total Distillation Loss: The total distillation loss is a linear combination of the binary classification distillation loss and the IoU-based localization distillation loss, aiming to optimize both classification and localization tasks concurrently. The total distillation loss \(L_{total}^{dis}(x)\) is defined as:

$$\begin{aligned} L_{total}^{dis}(x)=\alpha _1 \cdot L_{cls}^{dis}(x)+\alpha _2 \cdot L_{loc}^{dis}(x) \end{aligned}$$
(17)

where \(\alpha _1\) and \(\alpha _2\) are hyperparameters that adjust the relative importance of classification and localization losses in the total loss.

The design of these loss functions takes into account the foreground-background class imbalance inherent in dense object detection tasks, as well as the significance of localization accuracy. Through these meticulously crafted loss functions, effective training and optimization of the student model are achieved.

4 Experimental results and analysis

4.1 Experimental dataset

Fig. 8
figure 8

Number of labels for each category in the SCB-Dataset3-S dataset

Leveraging deep learning for the automatic detection of student behavior is a critical strategy for enhancing teaching effectiveness. Nonetheless, the lack of publicly available datasets on student behavior presents a significant challenge to researchers in this domain. The dataset employed in this study is the SCB-Dataset3 (Student Classroom Behavior dataset) [42], which reflects real classroom scenarios. SCB-Dataset3 includes two subsets: SCB-Dataset3-S, consisting of classroom behavior data from elementary and middle schools, and SCB-Dataset3-U, comprising university classroom behavior data.

Fig. 9
figure 9

Sample display of the SCB-Dataset3-S Dataset

Our primary focus is on SCB-Dataset3-S, which is utilized as the training and validation dataset. The SCB-Dataset3-S dataset consists of 5015 images and 25,810 annotations, categorized into three types: hand raising, reading, and writing. Figure 8 displays the number of instances for each category, while Fig. 9 presents example images from the dataset. Furthermore, relying solely on a single dataset may not provide a comprehensive evaluation of a model’s performance. Therefore, we also conduct transfer training on the relatively smaller SCB-Dataset3-U dataset to test the model’s generalization ability and robustness.

4.2 Evaluation metrics

In the classroom scenarios covered by this study, targets of various scales coexist, necessitating a comprehensive evaluation of an object detection model’s effectiveness. Consequently, the precision metrics employed in this paper include F1, mAP0.5, and mAP0.5:0.95. In parallel with testing the model’s performance, it’s also essential to evaluate the model’s number of parameters, model file size, FLOPs, and FPS to thoroughly analyze the model’s capabilities.

  1. (1)

    F1: Represents the harmonic mean of Precision and Recall, allowing for a combined assessment of accuracy and recall rates, with higher results being preferable. The calculation formula is as follows:

    $$\begin{aligned} \text{ Precision }&=\frac{T P}{T P+F P} \end{aligned}$$
    (18)
    $$\begin{aligned} \text{ Recall }&=\frac{T P}{T P+F N} \end{aligned}$$
    (19)
    $$\begin{aligned} F 1&=2 \times \frac{ \text{ Precision } \times \text{ Recall } }{ \text{ Precision } + \text{ Recall } } \end{aligned}$$
    (20)
  2. (2)

    mAP: Measures the average precision (AP) across all categories. AP for each category is determined by first plotting the precision-recall curve, then calculating the area under this curve, which represents the AP for that category. mAP is the mean of the AP values across all categories. The formula can be expressed as:

    $$\begin{aligned} m A P=\frac{1}{N} \sum _{i=1}^N A P_i \end{aligned}$$
    (21)
  3. (3)

    Number of Parameters: Used to evaluate the model’s size and complexity, it is obtained by summing the number of weight parameters for each layer. For lightweight models, a lower number of parameters is preferred.

  4. (4)

    FLOPs: Indicates the total number of floating-point operations required to perform a forward pass of the model, serving as an indicator of the model’s computational complexity and efficiency.

  5. (5)

    Onnx file size: Represents the size of the model file in Onnx format, which directly impacts the model’s deployability on various devices. The smaller the weight file, the lower the storage space requirement on devices.

  6. (6)

    FPS (Frames Per Second): Denotes the number of images the model can process per second. A higher FPS indicates better real-time performance of the model.

4.3 Experimental environment and parameter settings

In this study, the Pytorch 1.13.1 deep learning framework was used to train each model for 300 epochs on the SCB-Dataset3-S dataset and conduct transfer training for 100 epochs on the SCB-Dataset3-U dataset. The momentum was set to 0.937, weight decay to 0.0005, initial learning rate (lr0) to 0.01, image input size to 640\(\times\)640, and batch size to 8. The experimental platform operated on Ubuntu 20.04, with an Intel(R) Xeon(R) Platinum 8255C CPU at 2.5GHz, 32 GB RAM, and an NVIDIA RTX 3080 GPU. Additionally, inference speed tests were performed on a pure CPU device without a GPU, specifically an Intel(R) I5 12400. The model pruning rate is set at 25%. In the knowledge distillation section, we use the unpruned CSB-YOLO as the teacher model and the pruned CSB-YOLO as the student model. The distillation training runs for 300 epochs with a loss rate of 1.2.

4.4 Choice of Pruning Rate

Table 1 The impact of different pruning rates on the performance of CSB-YOLO
Fig. 10
figure 10

Comparison of the number of network channels before and after pruning

We conducted tests on the SCB-Dataset3-S dataset to assess the impact of pruning rate on accuracy. Table 1 shows the effects of different LAMP pruning rates on the performance of CSB-YOLO, ranging from 10% to 80%. It was found that with the increase in pruning rate, the model’s number of parameters, computational load, and model file size all gradually decreased, and the FPS significantly improved. However, the accuracy also gradually declined with the increase in pruning rate. After exceeding a 30% pruning rate, the accuracy began to sharply decrease, and at an 80% pruning rate, the model became unusable. An important observation is that when the pruning rate was below 25%, the fine-tuned model even surpassed the unpruned model in mAP0.5. This is because the original model contained a substantial amount of redundancy, which not only slowed down the model’s computational speed but also increased the difficulty of training. Therefore, removing this redundancy could improve the model’s training accuracy.

The primary objective of the pruning phase in this study is to maximize model lightweighting without compromising accuracy. At a pruning rate of 25%, the FPS saw a significant increase compared to a 20% pruning rate, while the accuracy was higher than at a 30% pruning rate and very close to that of the original model, making 25% a more cost-effective choice. Subsequent experiments in this paper all employed a 25% pruning rate.

Since the detection head is responsible for outputting the detection results, its structure should remain unchanged; thus, it was not involved in the pruning process. We conducted a comparison of the number of channels in layers other than the model’s detection head, where layers 0 to 9 constitute the backbone of the model, and layers 10 to 27 form the Neck part. Figure 10 illustrates a significant reduction in the number of channels, with the total channels decreasing from 4176 to 2587.

4.5 Selection of Distillation Loss Rate

Fig. 11
figure 11

Accuracy changes under different loss ratios

To achieve enhanced precision while ensuring model lightness, it is imperative to perform BCKD distillation on the model post-pruning. The importance of each layer’s weights is discernible through a comparison of channel counts before and after pruning. As observed in Fig. 10, a significant portion of pruning occurs within the model’s backbone, indicating the presence of considerable redundancy in this section, which minimally impacts the overall network accuracy. Consequently, we focused on distilling layers 15, 18, 21, 24, and 27, where the change in channel numbers before and after pruning was minor, signifying their crucial role in maintaining network accuracy.

Additionally, we selected the unpruned CSB-YOLO as the teacher model because the structures of layers 15, 18, 21, 24, and 27 remained almost unchanged before and after pruning, making the teacher and student models structurally similar. This similarity in structure can reduce the complexity of the distillation training process. In addition, experiments were conducted on the SCB-Dataset3-S dataset to evaluate the impact of varying the loss ratio coefficient during the distillation process, with the distillation training spanning 300 epochs. As depicted in Fig. 11, the optimal performance across all accuracy metrics was observed when the loss ratio was set to 1.2. Consequently, a consistent loss ratio of 1.2 was employed in the subsequent experiments of this study.

4.6 Comparison of different distillation methods

Table 2 Comparing different knowledge distillation methods

To validate the effectiveness of BCKD distillation, we introduced three feature distillation methods: cwd, mgd, and mimic, along with two logical distillation methods: L1 and L2. Tests were conducted on the SCB-Dataset3-S dataset, with the main parameters consistent with those described in section 4.5. The experimental results, as presented in Table 2, reveal that BCKD distillation achieved the best performance among all comparison methods, fully demonstrating the effectiveness of the BCKD distillation approach in classroom environments with densely populated targets.

4.7 Experiments and comparisons

Table 3 Performance comparisons of different models were conducted on the SCB-Dataset3-S dataset
Table 4 Comparisons of the accuracy across various categories for different models on the SCB-Dataset3-S dataset

To verify the effectiveness of the proposed CSB-YOLO, comparative experiments were conducted with commonly used object detection models on the SCB-Dataset3-S dataset. All models underwent 300 training epochs, with YOLOv8n as the baseline model. The results, as shown in Table 3, indicate that after model pruning and knowledge distillation, the CSB-YOLO model exhibited the lowest parameter count among all the models compared, at merely 23.9% of the baseline model’s parameters. Its GFLOPs were also reduced to 53% of the baseline. Furthermore, its accuracy surpassed all other smaller models tested; specifically, it achieved an mAP0.5 of 0.711, which is an increase of 0.8% over the baseline, and its mAP0.5:0.95 also improved by 0.3%. Although it is slightly less accurate than larger-scale models such as YOLOv7, YOLOv9-c, and YOLOv3, it significantly outperforms these models in terms of both parameter size and computational efficiency, key metrics for lightweight models.

Table 4 displays the accuracy across various categories, revealing that after pruning and distillation, CSB-YOLO outperforms the baseline in detecting behaviors such as raising hands and writing.

To observe the detection performance of the models more intuitively, we randomly selected image samples and visualized the detection results of YOLOv8n and CSB-YOLO. The results are shown in Fig. 12. It can be observed that from a rear perspective, the performance of both models is relatively similar. However, from a frontal perspective, CSB-YOLO exhibits superior detection effectiveness in areas where student targets are densely packed and also demonstrates commendable detection capabilities for smaller targets in the back rows of the classroom.

Fig. 12
figure 12

Visualization of detection results for YOLOv8n and CSB-YOLO

Table 5 Performance comparisons of different models were conducted on the SCB-Dataset3-U dataset

To assess the model’s generalization capabilities across similar scenes, we employed models trained on the SCB-Dataset3-S dataset as pretrained weights for transfer learning on the SCB-Dataset3-U dataset, with all models undergoing 100 training cycles. The results, as detailed in the Table 5, show that the CSB-YOLO model introduced in this paper is slightly lower in the mAP0.5 accuracy metric compared to the baseline. However, it achieves a significant 5.4% increase in the F1 score, while its parameter count and computational load remain considerably less than those of the baseline. Integrating three accuracy metrics, CSB-YOLO still outperforms all other small models involved in the comparison, thoroughly demonstrating its generalization capability in classroom settings.

Table 6 Comparison of lightweight metrics among different models

In addition to the accuracy testing mentioned above, we also conducted lightweight testing on low-performance CPU devices and Raspberry Pi 5 to evaluate the deployment feasibility of CSB-YOLO on low-performance devices. As shown in Table 6, the experiments demonstrate that CSB-YOLO, following pruning and distillation, has the smallest model file size among all comparison models, at just 2.95MB. Moreover, when running inference on a CPU, it achieves an FPS of 37.17, slightly lower than YOLOv5n but higher than all other comparison models. It’s important to note that CSB-YOLO’s accuracy significantly surpasses that of YOLOv5n. Furthermore, an FPS of 37.17 is more than sufficient for real-time detection needs. Therefore, CSB-YOLO offers a higher cost-effectiveness, making it highly suitable for deployment on low-performance devices in classroom settings.

4.8 Experimental comparison of different detection heads

Table 7 The comparative experiments of different detection heads on the SCB-Dataset3-S dataset

We compared various detection head structures based on YOLOv8n, and the results are shown in Table 7. Our designed detection head structure exhibits the smallest parameter count, computational load, model file size, and the best real-time performance among the compared detection heads. The experiments provide ample evidence that our designed ERD Head achieves excellent lightweight performance.

4.9 Ablation experiment

Table 8 Module ablation experiment

To validate the effectiveness of each enhancement module in terms of both lightweight design and accuracy improvement, we conducted ablation experiments on the baseline model YOLOv8n, incorporating various improvement modules. All experiments were conducted on the SCB-Dataset3-S dataset, with each training session spanning 300 epochs. The results, as presented in Table 8, show that after the integration of BiFPN, there was a slight decrease in mAP0.5 by 0.2%. However, there was a substantial reduction in parameter count, computational load, and model file size, along with a 4.08 increase in FPS. This indicates that the BiFPN structure significantly contributes to the model’s lightweight design without compromising accuracy. Building upon BiFPN with the addition of ERD Head led to a further decrease in accuracy, but also resulted in additional reductions in parameter count, computational load, and model file size, while FPS increased to 31.43. This effectively demonstrates the ERD Head’s capability to reduce model complexity and enhance computational speed.

To counteract the precision loss caused by the ERD Head, we introduced the C2f_SCConv module, which successfully raised the mAP0.5 to the level of the baseline. As observed in Fig. 13, replacing C2f with C2f_SCConv allows the network to accurately concentrate attention on the students within the scene. This improvement is due to the larger receptive field of C2f_SCConv, which significantly bolsters the network’s ability to represent features of irregular human targets that occlude each other, enhancing detection accuracy in complex classroom environments.

After further applying LAMP pruning to the model, the parameter count was reduced to 0.72M, merely 23.9% of the baseline, without compromising accuracy. Concurrently, the FPS increased to 37.17. Ultimately, BCKD knowledge distillation was performed on the pruned model, boosting the mAP0.5 to 71.1%, which is a 0.8% increase from the baseline.

Fig. 13
figure 13

Heatmap comparison between C2f_SCConv and C2f

5 Conclusion

This paper proposes a detection model named CSB-YOLO, which is specifically tailored for the detection of student behaviors in classroom settings. Designed to operate efficiently even in crowded classroom scenarios, this model has been optimized for lightweight deployment, allowing it to be easily and cost-effectively implemented on devices with limited computational capabilities typically found in classrooms. CSB-YOLO employs a BiFPN structure, replacing YOLOv8’s Neck structure, to reduce parameter size while enhancing feature fusion capabilities. This optimization enables the model to achieve accuracy levels comparable to YOLOv8n with fewer parameters. Additionally, we have designed a novel ERD Head, which significantly reduces the model’s parameter count and computational requirements while accelerating the model’s inference speed. To further address accuracy concerns stemming from lightweight design, we integrate SCConv into the C2f module, creating the C2f_SCConv structure, thus enhancing the model’s ability to represent human features. Employing LAMP pruning drastically reduces parameter size and computational requirements, resulting in an Onnx file size of only 2.95MB and an improved inference speed with an FPS of 37.17. Knowledge distillation further enhances the pruned model’s performance. Comparative testing on the SCB-Dataset3-S dataset demonstrates that CSB-YOLO achieves an mAP0.5 of 71.1%, marking a 0.8% increase over YOLOv8n, with parameter count and computational load only 23.9% and 53% of the latter, respectively. The lightweight design of CSB-YOLO ensures ease of deployment in real-world settings while maintaining high accuracy, meeting the demands for real-time student behavior detection in educational environments.