Keywords

1 Introduction

In the past decade, Internet witnessed the burst of video, especially for the mobile Internet. All kinds of videos extremely enhanced user experiences. However, different video contents have posed a new challenge, that is how to identify the Internet video types. It is necessary for large video sites such as Netflix, YouTube, and Amazon to classify their videos to provide high quality video services. Schools need to ensure that students are exposed to health videos. From the view of Internet management, it is necessary to pick out illegal videos from other videos.

Traditional video classification is generally divided into four ways: text-based approaches, audio-based approaches, visual-based approaches, and those that used some combination of text, audio, and visual features [3, 7, 11, 15, 20, 21, 23]. Many of the standard classifiers such as Bayesian, support vector machines (SVM) can be used for video classification. Gaussian mixture models (GMMs) and hidden Markov models (HMMs) are particularly popular on video classification in the past few years [3]. Much progress has been made in video classification in recent years. Convolutional Neural Networks (CNNs) has been proven to perform very well on video classification tasks [4, 12, 13]. However, all these techniques are based on condition that video contents can be inspected. If the video frame is encrypted or transmitted on a network device, then we can only measure the size of the video frame. In such cases, all these traditional techniques are invalid as the video contents cannot be inspected. Hence, in this work, we explore the method of time series classification based on video frame size.

Variable Bit Rate (VBR) Trace. Most popular streaming services use variable bitrate encoding. Therefore, the bitrate of an encoded video varies with its content. Variable bit rate encoding is also used on H.264 video. H.264 is the most widely used encoding standard for Internet video. In this standard, a video is encoded into a series of consecutive GOP (Group of pictures) groups. For each GOP, there is one I frame, several B frames and P frames. The I frame (intra coded picture) reference image is equivalent to a fixed image and is independent of other image types. Each image group starts with an I frame. A P frame (predictive coded picture) contains the difference information from the previous I or P frame. B frames (bidirectionally predictive coded pictures) contain difference information from previous and/or subsequent I or P-frames. VBR allows a higher bitrate to build more complex segment of media files while less space is allocated to less complex segments. Hence, the frame size changes with the bit rate. In this work, we choose B frame size traces as VBR traces.

The main contributions of this paper are summarized as follows:

  • To the best of our knowledge, this paper is the first to use the change point method on the video frame size traces to explore the scene classification. We propose a new fusion function to obtain change points more accurately.

  • We extract statistical features from variable bit rate (VBR) trace, resulting a larger and more effective feature set.

  • We verified the effectiveness of the extracted features on our own data set, and further explored the impact of different parameters on classification.

The remainder of the paper is structured as follows. First, we introduce the background about VBR trace in Sect. 2. Then, we outline the releted work in Sect. 3. Next, we present the framework in this work in Sect. 4. In Sect. 5, we describe in detail the method of extracting features. Implementation details and experimental results are described in Sect. 6. Discussion and future work are provided in Sect. 7.

2 Related Work

In general, the classification and matching of videos are mainly studied in two fields, one is the field of computer vision, and the other is the field of network security.

In the field of computer vision, there are many studies on video classification [4, 12, 13], mostly based on deep learning methods. They basically pay attention to the recognition of various actions. On the other hand, there are also research and explorations on video classification using Zero-shot learning [1, 2, 10, 24].

In the field of cybersecurity, R Schuster et al. [22] showed that due to the segmentation prescribed by the MPEG-DASH standard, many video streams are uniquely characterized by their burst patterns. R Dubin et al. [6] and J Gu et al. [9] also explored the burst patterns to identify encrypted video streams. For the same reason, H Li et al. [16] and X Liu et al. [18] explored the action recognition on surveillance traffic.

Last but not least, FHP Fitzek et al. [8] present a publicly available library of frame szie traces of long MPEG-4 and H.263 encoded videos. They also present a thorough statistical analysis of the traces. Q Liang et al. [17] used fuzzy techniques to model and classify MPEG VBR videos.

Inspired by all these efforts, we explore a novel method to classify H.264 encoded videos. We extract features of VBR traces in following sections. And further present the effect of different parameters and feature combinations on the results.

3 The Framework

Fig. 1.
figure 1

Framework with four steps

3.1 A. Video Pre-processing

We first convert the videos with different formats to a single format using Axiom [19]. The target parameters are shown in Table 1.

Table 1. Format parameter

Then each movie is split into multi segments with fixed length of 120 s. And then, we pick out all the segments with actions to build the action movie set, and the left ones for the other movie set.

3.2 B. VBR Traces Building

The VBR data are extracted from the movie sets using FFmpeg, as VBR is the most effective method to get video information without inspecting the video contents. As we know, B frame is the dominant frame type in video data. Most differences between frames are contained in B frame. Therefore we only use the VBR trace of B frame. Each VBR trace is an array \(D_i = (t_1, t_2, \dots ,t_n) \), thus all arrays with label form a vector sets \(D = \{D_1, D_2,\dots ,D_m\}\). Obviously, the length of each row in vector sets is not same because the number of each segment is not equal.

3.3 C. Feature Extraction

Due to the high dimensional raw data and the varying length of the VBR traces, it is necessary to extract high-level semantic features to reduce the computing complexity and to achieve high classification performance. Inspired by [18], a basic idea is to calculate trace rate change C for each row data \(D_i = (t_1, t_2, t_3, \dots ,t_n)\), where C is defined in Eq. 1.

As each VBR trace is essentially a time serial. We use a window sliding on each VBR trace to extract windowed-features. Then, some statistics are got from each window as the additional features. Detailed techniques will be introduced in the next section.

3.4 D. Classification

At the final step we utilize machine learning algorithms to discriminate the action movies from the other movies. To validate the effectiveness of the VBR trace and its features, we carry out our empirical studies using six well-known classic machine learning algorithms: Random Forests (RF), Logistic Regression (LR), Gaussian Naive Bayes (GNB), Decision Tree (DT), linear Suppor Vector Classification (LinearSVC), SVM with rbf kernel (SVM-rbf).

4 Extract Features

In this section, two novel feature extraction methods will be illustrated in detail. The first one is inspired by the method which is proposed in [18]. The second one is based on the change point detection techniques in statistics [14, 16]. Features we extracted are shown in Table 2.

Table 2. Features

In order to capture the information of VBR rate change, we first calculate the VBR trace rate change \(C_i = (a_1, a_2, a_3,\dots ,a_{n-1})\) of the ith VBR trace \(D_i = (t_1, t_2,\dots ,t_n) \). \(a_j\) represents the difference in frame size between the \((j+1)\)-th and the j-th frame.

$$\begin{aligned} a_j = t_{j+1} - t_j, \qquad j \in [1,n-1] \end{aligned}$$
(1)

The mean values of the VBR trace and the frame size rate change: These features can show the intensity of scene changes in the videos. Given a VBR trace \(D_i = (t_1, t_2, t_3, \dots ,t_n) \) and \(C_i = (a_1, a_2, a_3,\dots ,a_{n-1})\), the mean values \(\overline{t}\) and \(\overline{a}\) are defined as:

$$\begin{aligned} \overline{t} = \frac{1}{n}\sum _{i=1}^n t_i, \qquad \overline{a} = \frac{1}{n-1}\sum _{i=1}^{n-1} a_i \end{aligned}$$
(2)

The variances of the VBR trace and the frame size rate change: These features can show the complexity of scene change. Given a frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n) \) and \(C_i = (a_1, a_2, a_3,\dots ,a_{n-1})\), the variance \(t^{var}\) and \(a^{var}\) are defined as:

$$\begin{aligned} \begin{aligned} t^{var} = \frac{1}{n}\sum _{i = 1}^n (t_i - \overline{t})^2 \\ a^{var} = \frac{1}{n-1} \sum _{i=1}^{n-1}(a_i - \overline{a}) ^ 2 \end{aligned} \end{aligned}$$
(3)

The skewness of the VBR trace and the frame size rate change: These features describe the symmetry of data distribution. Given a frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n) \) and \(C_i = (a_1, a_2, a_3,\dots ,a_{n-1})\), the skewness \(t^{sk}\) and \(a^{sk}\) are defined as:

$$\begin{aligned} \begin{aligned} t^{sk} = \frac{ \frac{1}{n} \sum _{i = 1}^n (t_i - \overline{t})^3}{(\frac{1}{n-1} \sum _{i = 1}^n (t_i - \overline{t})^2)^{\frac{3}{2}}} \\ a^{sk} = \frac{ \frac{1}{n-1} \sum _{i = 1}^{n-1} (a_i - \overline{a})^3}{(\frac{1}{n-2} \sum _{i = 1}^{n-1} (a_i - \overline{a})^2)^{\frac{3}{2}}} \end{aligned} \end{aligned}$$
(4)

The kurtosis of the VBR trace and the frame size rate change: These features describe the shapes of the distribution of the original data. Kurtosis is a measure of whether the distribution is peaked or flat relative to a normal distribution. Given a frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n) \) and \(C_i = (a_1, a_2, a_3,\dots ,a_{n-1})\), the kurtosis \(t^{ku}\) and \(a^{ku}\) are defined as:

$$\begin{aligned} \begin{aligned} t^{ku} = \frac{ \frac{1}{n} \sum _{i=1}^n (t_i - \overline{t})^4}{ (\frac{1}{n} \sum _{i=1}^n (t_i - \overline{t})^2)^2} - 3\\ a^{ku} = \frac{ \frac{1}{n-1} \sum _{i=1}^{n-1} (a_i - \overline{a})^4}{ (\frac{1}{n-1} \sum _{i=1}^{n-1} (a_i - \overline{a})^2)^2} - 3 \end{aligned} \end{aligned}$$
(5)

Amplitude and Phase transformed from DFT: According to reference [18], we use DFT to obtain coefficients containing frequency information. We apply DFT in sliding window directly. Given a frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n) \), the coeffcients we get are complex numbers with the form of \(z = a + bi\). Amplitude and phase were proven to be effective in practice [18]. Thus, we use these features. Amplitude and phase are defined as:

$$\begin{aligned} Amplitude = \sqrt{a^2 + b^2}, \qquad Phase = \arctan \frac{b}{a} \end{aligned}$$
(6)

Sliding window: For each frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n)\), sliding window may get more detailed information. We do not fix the size of the windows, but the number of windows: m is fixed. A consequent issue is the impact of the parameter m, which will be explored in the empirical studies. Furthermore, we explored the effect of different m. For example, given a frame size traces \(D_i = (t_1, t_2, t_3, \dots ,t_n)\), result is \(D_i = (d_1,d_2,\dots ,d_j,\dots ,d_m)\), \(d_j\) is sub-sequence. The sub-sequece \(d_j\) is defined as:

$$\begin{aligned} d_j = {\left\{ \begin{array}{ll} \{ t_{[(j-1)\frac{n}{m}]},\dots , t_{j\frac{n}{m}}\}, \qquad 1 \le j \le (m-1)\\ \{ t_{[(j-1)\frac{n}{m}]},\dots , t_n \}, \qquad j = m \end{array}\right. } \end{aligned}$$
(7)

Since the length of a scene in each video is not fixed, fixed length of sliding window is not reasonable. The perfect case is that a single window corresponds with a single scene. Therefore, we apply PELT algorithm [14] first to detect change points. Then, we get more reasonable length of window.

Change point detection for a time series data \(y_{1:n}\), assume we get m change points with their positions \(\tau = \{\tau _{1}, \tau _{2}, \dots ,\tau _{m}\}\) in \(y_{1:n}\). Let \(\tau _{0} = 0\) and \(\tau _{m+1} = n\). One commonly used method to identify multiple change points is to minimize

$$\begin{aligned} \sum _{i=1}^{m+1}[\mathcal {C}(y_{(\tau _{i-1}+1):\tau _{i}})] + \beta f(m) \end{aligned}$$
(8)

Here \(\mathcal {C}\) is a cost function for a segment and \(\beta f(m)\) is a penalty guard against overfitting.

We present a weighted fusion cost function which combines cost function \(\mathcal {C}_{1}\) for exponential distribution with changing mean and cost function \(\mathcal {C}_{2}\) for normal distribution with variable variance. More formally, for a segmented subsequence \(y_{1:n}\) between \(\tau _{i-1} + 1\) and \(\tau _{i}\), \( n = \tau _{i} - (\tau _{i-1} +1)\), we have

$$\begin{aligned} \mathcal {C}_{1} = -n(\log (\sum _{j = 1}^{n} y_{j}) ) \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {C}_{2} = n \log \sigma ^{2} + \frac{1}{\sigma ^{2}} \sum _{j = 1}^{n}(y_{j}-\mu )^{2} \end{aligned}$$
(10)

Here,

$$\begin{aligned} \sigma ^2 = \frac{1}{n} \sum _{j=1}^{n}(y_{j} - \bar{y}) \end{aligned}$$
(11)
$$\begin{aligned} \mu = \frac{1}{n} \sum _{j = 1}^{n} y_{j} \end{aligned}$$
(12)

Figure 2 shows a case study of comparing the methods using \(\mathcal {C}_{1}\) ,\(\mathcal {C}_{2}\) and the combined cost function. As shown in Fig. 2, the distribution of the change points obtained by the cost function \(\mathcal {C}_2\) is relatively dense. The distribution of the change points obtained by the cost function \(\mathcal {C}_1\) is relatively sparse. Hence, we use cost function \(\mathcal {C} = \theta _{1} \mathcal {C}_1 + \theta _{2} \mathcal {C}_{2}\). In our study, we set the parameters as \(\theta _{1} = 0.7, \theta _{2} = 0.3\). We also carry out empirical studies to explore the impacts on these parameters.

Another problem is that some of the neighbour change points are too close to support reasonable segments. For the time of two change points \(T_{\tau _{i}}\) and \(T_{\tau _{i-1}}\), if \(T_{\tau _{i}} - T_{\tau _{i-1}} > 0.5\) s, then we ignore \(\tau _{i}\).

Fig. 2.
figure 2

The change points obtained by different functions. The first line and the second line are the results of normal distribution and exponential distribution, the last line is the result of the fusion of the two functions.

5 Evaluation

5.1 A. Date Collection

Table 3. Collection of videos

Collection of video is shown in Table 3. We first convert the videos with the parameters in Table 1 using [19]. Then each movie is split into multi segments with fixed length of 120 s. And then we extract B frame traces on each segment. Finally, we got 2144 samples including 379 action scene samples and 1765 other scene samples. Obviously, the data sets is imbalanced, we over-sample minority classes by the Synthetic Minority Oversampling Technique (SMOTE) [5]. In this work, we use cross validation method to evaluate the performance of extracted features, so we split samples to 5 folds randomly.

5.2 B. Experimental Setup

We use six model to evaluate the effective of features. Firstly, we test on features which is extracted by fixed windows. As the Table 4 shown.

We can evaluate the effect of different number of windows. Secondly, for the features extracted by change points method, we choose a best model in last step to evaluate the performance of those features. We arrange the features between the change points \(\tau _{i}\) and \(\tau _{i+1}\) as \((time \ span, mean, variance, skewness, kurtosis)\), and define them as features in an interval, called span features. Then different combinations of features are tested according to Table 5. We also use Wilcoxon’s Sign Rank Test to test the difference between different features.

Table 4. Combinations of features
Table 5. Different combinations of features

5.3 C. Performance Metrics

We use G-mean, Accuracy, F1-score to evaluate the performance of our features.

G-mean. We define g-mean as

$$\begin{aligned} G-mean = \sqrt{\frac{TP}{TP+FN} \times \frac{TN}{TN+FP}} \end{aligned}$$
(13)

where TP and FP represent the true postives and the false positives of samples, respectively. Besides, TN and FN represent the true negative and false negative of samples, respectively.

Accuracy. we define accuracy as

$$\begin{aligned} accuracy = \frac{TP + TN}{TP + TN + FN + FP} \end{aligned}$$
(14)

F1-score. we define f1-score as

$$\begin{aligned} f1-score = \frac{2TP}{2TP+FP+FN} \end{aligned}$$
(15)
Fig. 3.
figure 3

The impact of different features on different models. (a) is the performance of features on six models when windows \(m = 1\); (b) is the performance of the features obtained by the different numbers of windows on the random forest; (c) is the performance of Feature V1 on six models; (d) is the performance of the combination of several features of the two methods. Here, \(0.7\_0.3\_0.5\_V3\) means \(\theta _1 = 0.7\) and \(\theta _2 = 0.3\), the span is 0.5 s, so we can ignore some change points as mentioned above.

5.4 D. Experimental Results

In this part, we carry out experiments to evaluate the effectiveness of features. As the Fig. 3(a) shown, when the number of windows m is 1, the performance of random forest is best. The performance of svm algorithm with rbf kernel is better than that with linear kernel. The results show that our data requires a nonlinear method to fit. And random forest with 600 trees have more powerful fitness.

As the Fig. 3(b) shown, we experiment on features obtained with different window numbers by using random forest model. As the number of windows m increases, the g-mean score and f1-score score decreases. The best g-mean score, g-mean score of feature \(1windows\_DFT\), is higher 5.41% than feature \(9windows\_DFT\)’s. We guess that the fewer windows, the larger the window size, which can capture more macro fluctuation characteristics. Another phenomenon is that there is no significant difference between those with Fourier transform features and those without Fourier transform features. In theory, we believe that selecting the appropriate frequency domain characteristics can better reflect the fluctuation characteristics of the data. We guess the reason for this result may be that our method of selecting frequency domain features is not suitable.

As the Fig. 3(c) and (d) shown, for the features extracted by changepoints methods, The performance of random forest also is best. Decision trees and Gaussian Naive Bayes perform well. In the random forest experiment, there are no obvious differences between the three combinations of feature points based on change points method. And compared with the features extracted with only one window, the method of change points does not remind of obvious advantages. In theory, we think that the change points method is more reasonable, and the problem may lie in our treatment of features. Since the change point of each sample is different, the number of extracted features is different. In order to obtain training samples of equal length, we fill in the zeros behind the features, which causes a lot of information redundancy, so that the classifier does not get good results.

We conduct experiments on the change points data obtained by different combinations of \(\theta _1\) and \(\theta _2\). As the Fig. 4 and Fig. 5 shown, when \(\theta _1 = 0.4\) and \(\theta _2 = 0.6\), the result is the best one. And f1-score and accuracy, the result of feature \(0.4\_0.6\_V2\) is a little higher than the result of feature 1window.

Fig. 4.
figure 4

Results on various selection of \(\theta _1\) and \(\theta _2\). For example, \(0.1\_0.9\_V2\) means \(\theta _1 = 0.1\) and \(\theta _2 = 0.9\). We obtain the Feature V2 to test.

Fig. 5.
figure 5

Comparison between Feature 1window and Feature \(0.4\_0.6\_V2\)

In addition, through Wilcoxon’s Sign Rank Test, we compared the differences between different feature combinations. Firstly, we do hypothesis testing on the best features of the two ideas. Secondly, we do hypothesis testing for different combinations of features in each idea. The result is shown in Table 6.

Table 6. The results of Wilcoxon’s Sign Rank Test

The result of hypothesis testing show that there is no significant difference in performance between most of the features of our experiments. But for the best model in our experiments, such as random forest, the difference in different features is still obvious.

6 Conclusion and Future Research

In this paper, we explored a novel method to classifier videos. The VBR trace can show some semantic features which is useful to classification. Further, we explored the more effective features obtained by segmenting data using change points method. Basically, our features are effective in the binary classification of videos. But the result of the hypothesis test is not what we thought. We realize that there are still many unsolved things. We need to solve the problem of zero-filling of changing points. Maybe a faster and more effective change point algorithm can be used. The fusion of the objective function of the change point and the corresponding weight have a large adjustment space, etc. Finally, in the future, we can try the multi-classification task.