1 Introduction

Video-based action recognition has been widely investigated in the past decade, owing to its widespread applications in video surveillance, human action analysis, human-computer interaction, and so on. For action recognition in videos, the traditional methods are mainly based on hand-crafted features, such as 3D-Hessian, 3D-Harris, and improved dense trajectory (IDT) [2, 33, 59]. However, these methods are limited, to a large extend, by sampled interest regions when extracting video features. Motivated by the great promise of deep learning shown in image understanding, object detecting and target tracking, deep learning methods, such as two-stream CNNs and 3D CNNs, have been applied to the problem of action recognition [8, 9, 11, 19, 25, 48]. Unlike traditional methods, deep learning methods are end-to-end frameworks and can automatically learn and extract video features from global images, showing great potential for video-based action recognition.

Two-stream CNN [37] has been proven to be a successful architecture, which contains a spatial stream and a temporal stream to extract appearance and motion features from videos separately. Two-stream CNN and most of its variants [10, 12, 19, 37, 49, 57] mainly focus on how to effectively fuse these two streams or how to operate segment-based sampling and effectively aggregate the different segment features, for example, the spatiotemporal distilled dense-connectivity network (STDDCN) [12] and the temporal segment network (TSN) [49]. 3D CNN has also been a promising method for action recognition, which extends 2D convolution with the temporal domain [5, 7, 13, 42, 56]. The Convolutional 3D (C3D) [42], Inflated 3D ConvNet (I3D) [5], and Temporal 3D ConvNet (T3D) [7] have achieved satisfying results on both UCF101 and HMDB51. However, most of these studies have not focused on how to selectively enhance informative features. Over the recent years, the attention mechanism has been widely exploited in many fields, such as natural language processing (NLP) [41] and image captioning [53], to help generate a more effective attention mask for the tasks. We argue that the relationship of feature channels plays an important role and emphasizing informative features selectively can help two-stream CNN yield superior performance for action recognition. Thus, we introduce attention mechanism into two-stream CNN to generate a new weight distribution of features in both spatial and temporal streams to better model the human actions [1, 26, 43]. The idea can be viewed in Fig. 1.

Fig. 1
figure 1

a The attention mask generated by the network without attention unit. b The attention mask generated by the network with attention unit. c The channel-wise attention unit is embedded into basic two-stream architecture, which can enhance informative features for action recognition

In this paper, a model named spatial-temporal channel-wise attention network (STCAN) is proposed based on TSN to recognize human action. Specifically, STCAN includes a flexible and effective module dubbed channel-wise attention unit (CAU), which can be embedded into CNNs expediently. Inspired by [17, 26], to selectively highlight the informative features while preserving the integrity of the original features, the CAU consists of two parts: a squeeze-and-excitation (SE) operation module and a shortcut connection module. The former can capture the channel-wise dependencies and generate a new weight distribution for the feature channels by feature compression operation, which can enhance informative features and suppress less relevant features selectively. Considering that the former module may destroy the discriminative properties of the original features, the latter is designed as a bypass structure, which connects the input and output of CAU. Notably, the CAU is simple and effective, and only increases slight computational complexity to networks. Furthermore, we argue that the network performance may have great relations with the embedding position and number of CAUs, thus it is crucial to explore the most suitable position and the most appropriate number of CAUs for superior performance.

Our main contributions are summarized as follows:

  • A model named STCAN is proposed, which has the ability to model long-range temporal structure. And STCAN includes a simple but effective module named CAU, which can not only generate a channel-wise weight distribution to selectively enhance relevant features, but also maintain the discriminative properties of original features.

  • The CAU proposed can be embedded into CNNs expediently with only a slight increase in computational complexity and enables end-to-end training. We explore several strategies of embedding the CAU for superior performance.

  • Compared with the state-of-the-art approaches, STCAN achieves superior performance, which are evaluated on two challenging action recognition datasets, UCF101 and HMDB51.

The remaining of this paper is organized as follows. Section 2 discusses some related works, and Section 3 describes the proposed STCAN architecture in detail. In Section 4, the experimental details and results are provided. Section 5 gives some concluding remarks.

2 Review of related works

In this section, we will cover some works closely related to this paper, including human action recognition and attention mechanism.

2.1 Action recognition

Action recognition in videos has made significant progress in the past decade. State-of-the-art approaches can be roughly classified into two types: RGB-based ones and RGB-D-based ones.

RGB-based approaches can effectively extract appearance information and better represent action details, and they can be categorized into two types: ones using hand-crafted features and the others based on deeply-learned features. For the former approaches, many spatial-temporal feature detectors have been proposed, for example, 3D-Hessian [52], 3D-Harris [21], and improved dense trajectory (IDT) [46]. Additionally, to extract the appearance and dynamic information around interest regions, the histogram descriptors are developed, for example, Histogram of Oriented Optical Flow (HOF) [22], Histogram of Oriented Gradient (HOG) [22], Extended Speeded Up Robust Features (ESURF) [52], and Motion of Boundary History (MBH) [44]. Moreover, to form the feature representations, several encoding methods are applied. The classical encoding methods include Fisher Vector (FV), Bag of Visual Words (BoVW) [34], Vector of Locally Aggregated Descriptors (VLAD), and Multi-View Super Vector (MVSV) [4]. These traditional methods are simple to implement, but their ability to represent video features is limited by the sampled interest regions.

Unlike traditional methods, deep learning does not sample interest regions manually, instead, it can extract features autonomously through network training [3, 14]. There are two main research lines for action recognition in deep learning methods, one of which uses 3D CNNs to recognize the human action and the other is based on optical flow [5, 7, 13, 15, 30, 42, 56]. 3D CNN extends 2D convolution with the temporal domain, which can take the temporal information into consideration [5, 7, 13, 15, 42]. Carreira et al. [5] proposed two-stream I3D to learn seamless spatio-temporal feature extractors. Diba et al. [7] proposed T3D to introduce a temporal layer to efficiently model the temporal convolution kernel depths. He et al. [15] proposed a spatial temporal network architecture to model both local and global spatial-temporal features. One successful architect based on optical flow is two-stream CNN [10, 12, 33, 37, 49, 57], which uses optical flow images and RGB images respectively to extract motion and appearance features in parallel. Ng et al. [33] introduced the long short term memory (LSTM) module to two-stream CNN and could extract motion information more accurately from the images. Wang et al. [49] proposed a new method TSN to extract video features on the basis of two-stream CNN, which improved the accuracy dramatically. Feichtenhofer et al. [10] proposed a variety of methods to fuse the temporal and spatial networks, which further improved network performance. RGB-D-based approaches are not sensitive to illumination changes or dynamic camera views, and can accurately estimate the contour and skeleton of human body. The most typical ones are skeleton-based methods [35, 58]. Plizzari et al. [35] proposed a spatial-temporal transformer network to model dependencies between the joints. Zheng et al. [58] constructed a spatial and temporal graph convolution network to extract spatial-temporal features of skeleton for classification. However, most of the approaches lack the ability to distinguish the contribution of different features of the image.

Taking inspiration from the TSN architecture, we build a new model STCAN. In order to capture the channel-wise dependencies and enhance important features, we further introduce the attention mechanism.

2.2 Attention mechanism

Originated from human vision, the attention mechanism can assist the CNNs to focus on some specific features of the image and suppress less useful features [26]. We argue that through attention mechanism, CNNs can generate a weight distribution of image features, which can be applied to the original image to highlight the informative features.

Recently, the attention mechanism has been extensively studied in a lot of domains, such as human parsing, object tracking, image cropping, image captioning and so on [1, 17, 20, 23, 24, 26, 36, 41, 43, 47, 50, 51, 53, 55], and can also be applied in many practical applications [27,28,29]. For example, Tan et al. [41] designed an effective attention architecture for semantic role labeling. You et al. [53] combined bottom-up and top-down approaches through a semantic attention model to selectively focus on semantic concept proposals. Liao et al. [26] proposed a residual attention unit for 3D CNNs to highlight the foreground region for action recognition. Hu et al. [17] proposed a squeeze-and-excitation block to model channel-wise interdependencies to improve network performance. Li et al. [24] proposed a spatiotemporal attention to learn the discriminative feature representation. Shen et al. [36] designed an effective hierarchical attention Siamese network for object tracking. Zhang et al. [55] proposed a moving foreground attention model to pay more attention to the foreground targets. In this paper, we introduce attention mechanism into two-stream CNN and design an attention model CAU. Different from the previous works, the module CAU can not only emphasise informative features but also retain the discriminative properties of the original features.

In summary, inspired from TSN architecture and attention mechanism, we propose a new method STCAN, which can model long-term temporal structure and includes an attention unit CAU. The CAU, containing a SE operation module and a shortcut connection module, has the abilities to model the interdependencies among all the channels, selectively highlight the informative features, and retain the discriminative properties of the original features.

3 Spatial-temporal channel-wise attention network

In this section, we make a detailed description of the proposed STCAN. Firstly, the STCAN framework is introduced in Section 3.1. Then, the architecture of the CAU is presented in Section 3.2. Finally, the strategies to embed CAU are discussed in Section 3.3.

3.1 Spatial-temporal channel-wise attention network framework

The STCAN architecture is built on the basis of TSN [49]. Similarly, STCAN has the capacity to incorporate long-term temporal information into the learning of action models and learns action models efficiently with the segment-based sampling and aggregation scheme. Specifically, given a video V, STCAN firstly splits the video into K segments \(\{S_{1}, S_{2}, \dots , S_{K}\}\) with the same length, form which to randomly sample the snippets \(\{I_{1}, I_{2}, \dots , I_{K}\}\). Each snippet Ii could be 1 frame for RGB or 5 frames for optical flow. Then, the two patterns of frames are treated as inputs of the networks to extract features and make clip-level recognition predictions. Finally, the video-level prediction can be obtained by fusing these clip-level recognition predictions. The formula can be represented as follows:

$$ Net(I_{1}, I_{2}, \dots, I_{K}) =\mathscr{H}\left( \mathscr{G}(\mathscr{F}(I_{1};W), \mathscr{F}(I_{2};W), \dots, \mathscr{F}\left( I_{K};W)\right)\right). $$
(1)

where \({\mathscr{F}}(\cdot ;W)\) denotes the ConvNet function with parameters W, which takes Ii as input and produces scores over all the classes. \({\mathscr{G}}(\cdot )\) refers to segmental consensus function to combine the outputs of multiple snippets. \({\mathscr{H}}(\cdot )\) refers to the Softmax function to predict the probability of each class for the whole video.

figure a

The model STCAN contains a simple but effective module CAU in both spatial and temporal streams, which can selectively highlight the informative features and preserve the discriminative properties of the original features. Additionally, the deep CNN architectures (e.g. BN-Inception and Inception-v3 [31, 40]) are employed as the backbone. The ConvNet parameters are transferred from pre-trained models and the learning procedure is sketched in Algorithm 1. Figure 2 illustrates the basic framework of the model STCAN.

Fig. 2
figure 2

The basic framework of Spatial-Temporal Channel-wise Attention Network. The sample video is divided into K segments (K = 3 for example), from each to randomly sample a short snippet (1 frame for RGB or 5 frames for optical flow). The CAU is embedded into both streams to selectively highlight the informative features. The snippet scores are fused to yield a video-level prediction

3.2 Architecture of the channel-wise attention unit

Figure 3 shows the architecture of the CAU. Inspired by [17, 26], the CAU contains two parts: the shortcut connection module and the SE operation module. The former, connecting the input and output of the CAU, can prevent the loss of discriminative properties of the original features. The latter makes full use of the channel global information to obtain the channel-wise dependencies and adaptively recalibrate the feature responses, which works in the squeeze-and-excitation way.

Fig. 3
figure 3

The architecture of Channel-wise Attention Unit

The squeeze and excitation operations mentioned above are implemented by five layers, of which the first layer is to fulfill the squeeze operation and the remaining layers are designed to complete the excitation operation. In this paper, the global average pooling, a simple aggregation technique, is exploited as the first layer. The remaining four layers consist of two convolutional layers (namely Conv1 and Conv2), one ReLU layer and one Sigmoid layer. The function of these two convolutional layers is to fuse the features of each channel, and the ReLU layer makes this module capable of learning the non-linearity between channels. In addition, to ensure that multiple channels can be emphasized, we opt Sigmoid layer to learn non-mutually exclusive relationships.

Consider the input \(x\in \mathbb {R}^{N\times C\times W\times H}\), where N refers to the number of training samples in each batch, C refers to the number of channels, and W and H represent the width and the height of the sampled image. Then, we can obtain the channel statistic \(z(x)\in \mathbb {R}^{N\times C\times 1\times 1}\) by operating the global average pooling on individual feature channel of each training sample. The c-th channel on n-th sample of z(x), denoted as \(z_{cn}(x)\in \mathbb {R}^{1\times 1}\), can be formulated as follows:

$$ z_{cn}(x)=\frac{1}{W\times H}\sum\limits_{i=1}^{W}\sum\limits_{j=1}^{H}x_{cn}(i,j), $$
(2)

where xcn(i,j) is the value at position (i,j) of the c-th channel on n-th sample of the input x. Furthermore, the output of excitation operation \(A(x)\in \mathbb {R}^{N\times C\times W\times H}\) can be written as follows:

$$ A(x)=\sigma (\mathscr{F}_{2}(\delta (\mathscr{F}_{1}(z(x);W_{1}));W_{2})), $$
(3)

where σ refers to the Sigmoid function, and \({\mathscr{F}}(\cdot ;W)\) denotes the convolutional network function with parameters W. More specifically, \({\mathscr{F}}_{1}(\cdot ;W_{1})\in \mathbb {R}^{N\times \frac {C}{r}\times 1\times 1}\) denotes the Conv1 function and \({\mathscr{F}}_{2}(\cdot ;W_{2})\in \mathbb {R}^{N\times C\times 1\times 1}\) denotes the Conv2 function, δ refers to the ReLU function. The Conv1 layer is a dimensionality-reduction layer and the reduction ratio r is set to 16. The Conv2 layer is a dimensionality-increasing layer, which makes the output dimension return to the channel dimension of input x. Note that A(x) is a scalar, which represents the weights of channel feature maps of the input x, and is obtained by learning from the two convolutional layers and ReLU layer.

Thus, the final output of the CAU, \(H(x)\in \mathbb {R}^{N\times C\times W\times H}\), can be represented as follows:

$$ H(x)=x\oplus (A(x)\odot x), $$
(4)

where ⊕ represents element-wise addition and this operation represents the shortcut connection, and ⊙ denotes element-wise multiplication, which means that a weight coefficient is added to the feature maps of each channel. For further explanation, the overview of squeeze and excitation operations is shown in Fig. 4.

Fig. 4
figure 4

The overview of squeeze and excitation operations

Remarkably, the purpose of CAU is twofold: (a) It can model the channel-wise interdependencies and adaptively recalibrate feature responses to selectively enhance informative features and suppress less relevant features. (b) It can prevent the loss of discriminative properties of the original features through a bypass structure. The proposed CAU is flexible and can be embedded into CNNs conveniently. Nevertheless, simply applying the CAU may not achieve satisfactory optimization performance, so how to embed the CAU into networks is worth further consideration.

3.3 Embedding strategies

In this subsection, the strategies of embedding CAU into CNNs are investigated. Since the network performance may have great relations with the embedding positions and number of CAUs, we design several different strategies to seek the most suitable position and the most appropriate number of CAUs for superior performance.

As mentioned above, Inception networks are employed as the network backbone, which contain a sequence of convolution Inception blocks. Taking Inception-v3 for example, we select the first four convolution Inception blocks (named Block1, Block2, Block3 and Block4) for experiment and exploit Kinetics dataset for pre-training. First, only one CAU is embedded into the Inception-v3 for experiment, following Block1, Block2, Block3 and Block4 individually. For instance, Fig. 5 shows the structure when a CAU is embedded following Block1. Then, multiple CAUs are embedded following the chosen convolution Inception blocks for further experiments. By this method, we can not only find the appropriate positions to embed CAU, but also get whether multiple CAUs can further optimize the network performance.

Fig. 5
figure 5

The overview when a CAU is embedded following Block1

4 Experiments

In this section, the datasets involved and implementation details are firstly introduced. Several good practices are then investigated to prevent overfitting in the process of training. Finally, we carry out experiments to evaluate the effectiveness of STCAN on two challenging datasets.

4.1 Datasets and implementation details

4.1.1 Datasets

Two action recognition datasets are applied to verify the effectiveness of the proposed model STCAN, namely HMDB51 and UCF101. HMDB51 [18] consists of 51 action categories and 6766 short video clips with the resolution of 320×240. HMDB51 is divided into three sub-datasets (called split1, split2 and split3), and each contains 3570 clips for training and 1530 clips for testing. The experimental results are validated on these three sub-datasets, and we take the average value of these three results as the final accuracy on HMDB51.

UCF101 [39] consists of 101 action categories and 13320 short video clips with the resolution of 320×240. This dataset is also divided into three sub-datasets. Similarly, the experimental results of UCF101 are verified on three sub-datasets and we take the average value of these three results as the final accuracy on UCF101.

We downloaded HMDB51 dataset at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/, and download UCF101 dataset at https://www.crcv.ucf.edu/data/UCF101/UCF101.rar. To extract RGB images and optical flow images of all frames of the datasets, we choose the OpenCV and DenseFlow toolkits complied by GPU. Moreover, the TV-L1 (Total Variation-L1) [54] algorithm is used to calculate the optical flow.

4.1.2 Implementation details

From the perspective of network inputs, a single RGB frame and 5 stacked consecutive optical flow frames that randomly sampled from each video clip are treated as inputs of the spatial and temporal streams respectively. Figure 6 shows the RGB images and their corresponding optical flow images. To extract the optical flow frames, we take the TV-L1 [54] optical flow algorithm. Specifically, it takes about 18 hours to extract frame images (RGB images, horizontal and vertical optical flow images) of all videos in UCF101, and takes about 5 hours in HMDB51.

Fig. 6
figure 6

The RGB images and their corresponding optical flow images. The first row is the RGB images, and the second and third rows are the optical flow images in x and y component

In the training process, four data augmentation methods, namely random clipping, corner clipping, horizontal flipping, and scale jittering [38], are adopted for pre-processing. The clipping size of scale jittering is randomly determined according to the jittering ratios, which are set to 1, 0.875, 0.75, and 0.66. Then, the clipping areas are uniformly adjusted to 224×224 for network training. The stochastic gradient descent (SGD) algorithm is used to learn the network parameters. The batch size is set to 32 and the momentum is set to 0.9. BN-Inception and Inception-v3 are employed as the backbone. In addition, pre-trained models from Kinetics or ImageNet are exploited to initialize the network weights. Moreover, the number of segments is set to 3. Since the average pooling can model multiple segments jointly and capture visual information from the entire video, we apply average pooling as aggregation function to calculate the prediction scores of the segments. The integration ratio of spatial and temporal streams is set to 1:1.6.

For the temporal stream parameters on UCF101, the total number of iterations is set to 9000. The learning rate is initialized as 0.001, and reduces to its 0.1 times after 5000 and 8000 iterations. For the spatial stream, the total number of iterations is set to 4000. The learning rate is initialized as 0.0005, and changes to its 0.1 times for every 1500 iterations. On HMDB51, the overall process is similar as that of UCF101. For temporal stream, the total number of iterations is set to 5000. The learning rate is initialized as 0.001, and reduces to its 0.1 times after 2500 and 3500 iterations. For the spatial stream, the total number of iterations is set to 3000. The learning rate is initialized as 0.0005, and changes to its 0.1 times for every 1000 iterations.

In order to prevent overfitting [32], we adopt two methods: gradient clipping and dropout [16]. The parameter of gradient clipping (clip_gradient) is set to 20 in temporal stream and set to 40 in spatial stream to make the gradient restrained in a fixed range to prevent gradient explosion during training. Moreover, the dropout layers are added between the global pooling layer and the FC layer of spatial and temporal streams, and the parameters (dropout_ratio) are set to 0.7 and 0.8 respectively.

The experiments are carried out in the Ubuntu 18.04 with RTX 2080 TI and the proposed model is built on Caffe. Since only a single RTX 2080 TI is used, the network training is time-consuming. For instance, when training BN-Inception network on UCF101, the total training time of the temporal stream is about 20 hours, and the spatial stream about 2.5 hours.

4.2 Experimental results

4.2.1 Embedding experiments of CAU

In this subsection, we will explore the appropriate position and number of CAUs for spatial and temporal stream networks, and experiments are carried out on HMDB51. Meanwhile, Inception-v3 architecture is employed as the backbones.

Since the bottom features and top features contain different information, the former represent low-level texture information and the latter represent semantic information, it is worth exploring which features are appropriate for CAU. To this end, we evaluate the performance of embedding a single CAU following different convolution Inception blocks, such as Block1, Block2, Block3 and Block4. Additionally, we argue that embedding multiple CAUs into multiple positions may achieve satisfactory optimization performance, thus it is worth pondering whether multiple CAUs can further increase the network performance.

Tables 1 and 2 show the spatial and temporal stream results of embedding different numbers and different positions of CAUs into networks. It figures out that embedding CAUs into Inception-v3 can significantly improve the accuracies in both spatial and temporal streams, and inserting a single CAU following Block1 can get the best performance. Note that embedding multiple CAUs dose not further improve performance as expected. Conversely, the accuracy decreases as the number of CAU increases. From the results in Tables 1 and 2, we deduce that the bottom features can fit the CAU better, which can be explained that low-level features contain rich texture information and are more conductive to generating attention mask. Otherwise, high-level features are more abstract and may lead to disharmony with CAU. Additionally, if the bottom features are modulated repeatedly by CAU in multi-layers, the attention mask already processed may be destroyed and lose some important information.

Table 1 Spatial stream results of embedding different numbers and different positions of the CAUs into Inception-v3 on HMDB51 split1
Table 2 Temporal stream results of embedding different numbers and different positions of the CAUs into Inception-v3 on HMDB51 split1

4.2.2 Comparison of pre-trained datasets

As the two-stream CNNs take images as inputs, it is feasible to apply the models trained on Kinetics or ImageNet as initialization. Additionally, experiments are performed to investigate the impact of pre-trained datasets on recognition performance. Furthermore, BN-Inception is employed and the parameters can be optimized with back propagation algorithm. Table 3 gives comparison results of different pre-trained datasets on UCF101.

Table 3 Comparison results of BN-Inception pre-trained on ImageNet or Kinetics on UCF101

As shown in Table 3, exploiting the model pre-trained on Kinetics can obtain better results compared with that pre-trained on ImageNet, which indicates that the capacity of dataset can affect the recognition performance. Moreover, we argue that the generalization ability and the recognition performance of the network may improve with the increase of the dataset capacity. Therefore, the model pre-trained on Kinetics is exploited as initialization for the subsequent experiments by default, unless stated otherwise.

4.2.3 Evaluation of proposed STCAN

As stated above, embedding the single CAU following Block1 will obtain the best performance, thus this strategy is used for subsequent experiments. In addition, we adopt the transfer learning method and exploit the model trained on Kinetics as initialization, which can effectively solve the problem of insufficient training samples. The experiment performance are evaluated on UCF101 and HMDB51, and BN-Inception and Inception-v3 architectures are employed. The comparison results of STCAN and the baseline(Not embedded with CAU) are shown in Table 4, and the detailed accuracies are summarized in Tables 5 and 6.

Table 4 Comparison results of STCAN and Baseline on UCF101 and HMDB51
Table 5 Comparison results of STCAN and Baseline with BN-Inception
Table 6 Comparison results of STCAN and Baseline with Inception-v3

As shown in Table 4, STCAN yields superior performance than baseline on both UCF101 and HMDB51. Furthermore, by comparing the accuracies in Tables 5 and 6, we can make the following three observations. First, both the spatial and temporal networks embedded with CAU perform better than those without CAU, which verifies the feasibility of the method proposed. Particularly, we find an interesting thing that the improvement generated by spatial network is greater than that of temporal network. We suspect it may be because that the useless information, which is mostly high-frequency, often appears in RGB images but not optical flow images. Also, it may be because the less useful information, such as background parts, in the optical flow images are often static. Second, when embedding CAU, BN-Inception can achieve comparable or even better performance than Inception-v3, which may indicate that CAU has a more obvious improvement on BN-Inception. Third, the improvement on HMDB51 is more significant than that on UCF101. We suggest that there may be two reasons: (a) From the perspective of dataset capacity, the actual training dataset is not infinite. CAU is designed to generate a weight distribution to selectively enhance informative features. When the training data becomes sufficient, the generalization ability of the network will be improved, and thus the effect of CAU will become less significant. (b) From the perspective of video clip content, many clips of different categories on HMDB51 have the same scene and embedding CAU may allow the network to focus more on the action features.

4.2.4 Evaluation of proposed module CAU

To evaluate the influence of the proposed module CAU, we perform a comparative experiment of embedding a SE block [17] into the same position of BN-Inception or Inception-V3 and the results are shown in Table 7. The baseline is carried out on BN-Inception or Inception-V3 with no module embedded in. As shown in Table 7, the module proposed can obtain better results, which demonstrates the effectiveness of the proposed module CAU.

Table 7 Comparison of different embedding modules

In addition, to evaluate the computational complexity that CAU increases, we compare the models with FLOPs (floating point operations) and report the results in Table 8. From the results, we can infer that embedding CAU into BN-Inception and Inception-v3 will only lead to a small amount of computation while the performance increases significantly.

Table 8 Comparison on complexity of Inception networks with CAU

Moreover, to provide a clearer comparison of the influence of CAU on model training and testing, the sample training and testing curves for runs of the spatial stream architectures with and without CAU are depicted in Fig. 7. We observe that the network embedded with CAU can produce a steady improvement during both training and testing process, and this result is verified in both HMDB51 and UCF101 datasets.

Fig. 7
figure 7

Training and testing curves of Spatial stream architectures with and without CAU. (a) Training results for the split1 of two datasets. (b) Testing results for the split1 of two datasets

4.2.5 Comparison of all categories on the two datasets

Furthermore, we compare the accuracies of all action categories on the two dataset (UCF101 and HMDB51). We evaluate the performance of Inception-v3 with or without CAU on UCF101 and HMDB51, and the comparison detail results are shown in Figs. 8 and 9 respectively. It is worth pointing out that the approach proposed performs better in almost all categories, taking UCF101 for example, especially HandstandWalking, JumpingJack, JumpRope, etc. However, it performs worse in the categories, such as ThrowDiscus, which may be due to the similarity between the category Shotput and ThrowDiscus.

Fig. 8
figure 8

Comparison results of all categories on UCF101 (split1)

Fig. 9
figure 9

Comparison results of all categories on HMDB51 (split1)

4.2.6 Comparison with the state-of-the-art

In this subsection, We compare the method STCAN with several state-of-the-art approaches, including the traditional methods [34, 45, 46], the two-stream methods [10, 33, 37, 49], and the 3D convolution methods [5, 13, 15, 24, 42]. The comparison results are summarized in Table 9. The best accuracies are 88.30% (UCF101) and 61.70% (HMDB51) in traditional methods, and 98.40% (UCF101) and 81.40% (HMDB51) in deep learning methods.

Table 9 Comparison results with current state-of-the-art methods on UCF101 and HMDB51

We observe that the method STCAN outperforms the traditional methods and most 2D convolution methods on both datasets. Moreover, our results significantly outperform the two-stream baseline [37] by 8.18 and 15.77 percent. When compared with 3D CNN methods, the method STCAN achieves comparable performance. Specifically, I3D (RGB+ Flow) and STA-MARS+RGB+Flow perform better than STCAN, probably because that I3D can learn seamless spatio-temporal feature extractors through its 3D filters and STA-MARS+RGB+Flow can efficiently learn the discriminative feature representation of actions. MARS (Motion-Augmented RGB Stream) [6] is a 3D CNN model that trained using a linear combination of feature-based loss and standard cross-entropy loss, and its performance is improved by 0.2% and 1.7% respectively to 98.40% (UCF101) and 81.40% (HMDB51) with the help of module STA [24]. For further comparison, the complexity of these methods is shown in Table 10. From the observation, we can find that the method proposed has smaller computational complexity. Remarkably, our approach improves the recognition performance, and the accuracies are up to 96.18% (UCF101) and 75.17% (HMDB51).

Table 10 Comparison on complexity with other methods

5 Conclusion

In this paper, a new method, named STCAN, has been proposed, which can model long-range temporal structure. Specifically, STCAN includes a simple and effective module named CAU, which can be embedded into the spatial and temporal streams expediently. The CAU can not only model channel-wise interdependencies to adaptively recalibrate the feature responses, but also prevent the loss of discriminative properties of the original features. Moreover, the model proposed can be trained end-to-end and has smaller computational complexity compared with the ones that achieve comparable results. Since this method has the ability to selectively highlight the informative features, future work will extend the ideas of this paper to other interesting domains of recognition, such as video surveillance and target tracking.