A one stream three-dimensional convolutional neural network for fire recognition based on spatio-temporal fire analysis

Daoud, Zeineb; Ben Hamida, Amal; Ben Amar, Chokri; Miguet, Serge

doi:10.1007/s12530-024-09623-3

A one stream three-dimensional convolutional neural network for fire recognition based on spatio-temporal fire analysis

Original Paper
Published: 10 September 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Evolving Systems Aims and scope Submit manuscript

A one stream three-dimensional convolutional neural network for fire recognition based on spatio-temporal fire analysis

Download PDF

20 Accesses
Explore all metrics

Abstract

Fires are among the most frequent disasters, causing serious injuries and extensive property destruction. In order to prevent the uncontrolled spread of fires, recognizing fires accurately and at an early stage is crucial, especially in video surveillance applications. The majority of the available deep fire detection models currently operate on single images, limiting their analysis to spatial features only. The temporal context and motion information, present in consecutive frames of a scene, are not involved leading to incorrect predictions throughout the video. To address this shortcoming, it is proposed in this work to explore the temporal information using deep learning networks to directly recognize fire. Indeed, a novel three-dimensional convolutional neural network, named 3D Fire Classification Network, is introduced. This approach exploits spatio-temporal features to analyze and recognize a video sequence as either fire or non-fire. Initially, the input data is processed to enlarge and diversify the constructed dataset. Then, it is passed through the designed network for training. The derived model comprises a relatively smaller number of layers, with a reduced number of parameters. The conducted experiments demonstrate the efficiency of the resulting model on the created dataset, achieving an improved accuracy of 99.23%. Furthermore, the findings show that the developed model consistently outperforms the related methods in recognizing fire videos.

Video Based Fire Detection Using Xception and Conv-LSTM

Fire Detection Model Using Deep Learning Techniques

Early Wildfire Detection Using Convolutional Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the last few decades, the number of fires has been increased in the world, threatening the planet and people’s safety causing hazardous effects and huge damages. In fact, millions of acres are burned destroying animals, trees, homes, and people. According to the Center for Disaster Philanthropy, the 43 member countries including some states in the Middle East and North Africa, where forest fires were previously infrequent, are now seeing a significant increase in the fires number and the burnt area (Center Philanthropy 2022). In 2021, more than 550,000 hectares were burned in the European Union and its Mediterranean neighboring countries including Turkey, Algeria and Tunisia. In Turkey, wildfires were around 2793 burning about 139,503 ha. In Tunisia, meanwhile, 28,493 ha were affected (Statista 2022). As reported in the Technical Report of the Mediterranean wildfires, the most affected countries by wildfires, in 2021, were Greece, Italy, followed by Portugal, Spain and France where hundreds of people were killed and over than 620,000 ha were burned in July and August (Eberle and Higuera Roa 2022). An automatic detection of fire events is thereby essential for security applications and intelligent surveillance. This is particularly when considering the ineffectiveness of the available sensor-based fire detection systems (Avazov et al. 2022). With these captors, the alarm is only triggered when smoke or heat is close enough to the sensors. In certain situations, such as expansive open areas or high-temperature environments, using a sensor-based detection system becomes impractical as it leads to frequent false alarms (Avazov et al. 2022). In addition, these systems lack the capability to provide visual information, which may be essential for helping firefighters in quickly understanding the fire scene. Important details concerning the fire size, locations and behavior are also not provided.

To cope with all these shortcomings, researches and development efforts have been carried out to reduce false alarms risks and to achieve accurate real-time fire detection early-stage. This has become particularly significant with the advancement of video surveillance systems. Add to that, the computer vision techniques and the recent evolution of neural networks are frequently employed in such applications (Çetin et al. 2013; Muhammad et al. 2019). In fact, numerous significant studies related to fire recognition have been proposed over the years, including the video-based fire detection systems (VFD) and the deep learning-based fire detection systems. VFD approaches are the first developed works as an alternative solution to the existing sensor-based systems (Chen et al. 2004; Celik 2010). The detection process involves a manual extraction of flame features, such as color, texture, shape, and motion (Khalil et al. 2021; Wahyono, Harjoko et al. 2022). This extraction step, the features analysis, the subsequent detection and classification processes are time-consuming, which limits the use of these VFD methods in real-time. To avoid hand-crafted operations and to ensure automatic fire detection task, deep learning (DL) models have been extensively employed in recent times (Bhat and Khan 2022; Gayathiri e al. 2023; Harsha et al. 2023). They have proven their efficiency in different fields, including object recognition, machine learning (Manohar and Das 2022), parameters estimation (Manohar and Das 2023), and medical analysis (Lakshmi et al. 2024; Manohar and Das 2023). The features they extract are learned automatically from annotated data rather than hardcoded by the developper. Hence, these DL models can be adapted for an automatic fire recognition in surveillance applications, as presented in this paper.

It is therefore interesting to develop a real-time fire recognition method based primarily on exploiting both the spatial and temporal information in video sequences. It is noticed that the most developed models in the literature are designed by exploring spatial features, which provide the visual appearance and contextual information of the data. Spatial features are directly extracted from frames through through two dimensional convolutional neural networks (2D CNNs). However, these models lack temporal features, that capture the motion dynamics occurring in the range of video frames. This is since the fire event is considered as an action in video sequences, distinguished by its spatial and temporal features across successive frames. The objective of fire recognition is to effectively learn discriminative spatio-temporal representations from video sequences to identify the fire class. Motivated by these claims, the direct learning of spatio-temporal features from video frames is suggested in this work. It is achieved using three-dimensional networks (3D CNNs). With much interest to this, the novelty of this contribution is to recognize fire in surveillance videos by designing a three-dimensional network, known as 3D Fire Classification Network “3D FireClassNet”. The presented approach starts by preprocessing the input data for its for enlargement and diversification. Then, this preprocessed data is fed through the novel 3D network for training. The designed architecture has the capability to be directly applied to consecutive frames, for the extraction and learning of spatio-temporal features.

The remainder of this paper is organized as follows. Section 2 provides a background on the existing deep learning architectures, used for spatio-temporal analysis. In Sect. 3, a literature review of DL-based approaches designed to recognize fire in video sequences is presented. Section 4 thoroughly describes the details of our proposed method with the novel spatio-temporal convolutional neural network. The experimental findings and discussions are exhibited in Sect. 5, including a comparative study with the state-of-the-art works. Lastly, Sect. 6 deals with the conclusions of this paper.

2 Deep learning architectures for spatio-temporal analysis

In general, videos are constructed from spatial and temporal domains, providing much more information content, compared to a single image. Spatio-temporal features extraction methods can be categorized as either hand-crafted or automated (Rasool Abdali and Ghani 2019; Mehta and Singh 2023). Deep learning networks offer the ability to automatically capture these features, producing promising results in various spatio-temporal approaches for different applications. The following subsections outline the most common architectures employed in deep learning for capturing and learning spatio-temporal features.

2.1 Two-stream architectures

In order to exploit both spatial and temporal features, several researchers have proposed the development of two-stream architectures. These architectures consist of two separate CNNs (Simonyan and Zisserman 2014). Each one serves for a specific purpose. The first convolutional neural network is the spatial stream, dedicated to handle spatial features. The second one is the temporal stream, devoted to handle temporal features. The outcomes of these distinct networks can be merged to create a spatio-temporal video representation.

As shown in Fig. 1, the spatial stream takes a single frame from the video to pass it through a series of CNN kernels. Then, using the extracted spatial information, predictions are generated. The second temporal stream operates by gathering optical flows of each frame. Using this motion data, predictions are produced by the final fully connected and SoftMax layers of the temporal network. Each individual stream is constructed using a deep ConvNet. The softmax scores produced by these networks are merged by the use of the late fusion technique. With this technique, the two streams are independently trained and combined just before the model makes the last decision. The final probability is attained by averaging the predicted probabilities derived from both streams.

Relying on an external optical flow algorithm is one of the disadvantages of this type of architectures. This algorithm needs to be executed to compute the motion vectors for each video, it is performed before the training phase. Being coupled, coupled with the training processes for the two networks, leads to a substantial amount of time required for the design of the final model.

2.2 Convolutional neural network and long short-term memory (CNN-LSTM)

A second type of deep learning architectures is the association of a Long Short-Term Mermory (LSTM) to a CNN for video analysis. LSTM is a variant of recurrent neural networks (RNNs), introduced by Hochreiter and Schmidhuber (1997). It is designed to enhance its ability to retain information for long periods. It is structured as a sequence of recurrent cells linked together, where each cell is connected to the next one through a cell state (C). This property makes LSTM well adapted to tasks requiring long-term memorization (Karthika et al. 2023). That’s why some researchers propose the training of LSTM networks to create temporal models.

The combination of an LSTM with a CNN enables the extraction and learning of spatio-temporal features from videos. This architecture is illustrated in Fig. 2. The CNN acts as an extractor of spatial features. Indeed, the individual images in a video sequence are fed into a CNN model, which in turn extracts spatial features. These features are subsequently passed through the LSTM layer. The output derived from the LSTM layer is connected to a fully connected layer, resulting in the recognition purpose. The main goal of incorporating LSTM is to capture the temporal connections among images by retaining a memory of preceding frames. This enables the model to understand and use the sequential information presented in the video data.

This architecture may take advantage of transfer learning by exploiting a pre-trained CNN model, such as VGG-16, VGG-19, ResNet, and others to extract spatial features. The transfer learning is an effective method for building accurate models, particularly in cases of limited data. As a result, a combination of CNN and LSTM shows its robustness in learning spatio-temporal features and developing efficient models for video analysis. Nevertheless, it’s important to acknowledge that a CNN needs significantly long training time to fine-tune the vast number of model’s parameters. This issue becomes more complex when considering the extension of temporal aspects in the architecture. This is because the network requires to process not just single frames, but also several video frames simultaneously. Similarly, LSTM training takes a large time, since it has more parameters (Kanna and Santhi 2021). Add to that, it needs more memory requirements.

2.3 Convolutional long short-term memory (ConvLSTM)

Another category of deep learning architectures designed for video recognition is the convolutional long short-term memory (ConvLSTM). It represents a variant of the LSTM architecture that integrates convolutional operations within the structure of the LSTM cell (Shi et al. 2015). These convolutional operations are introduced during transitions between layers. They replace the internal matrix multiplications of the LSTM (Verlekar and Bernardino 2020), as shown with the red color in Fig. 3. Thus, the information passing through ConvLSTM cells retains the input dimension, enabling the network to achieve better spatio-temporal correlations (Verlekar and Bernardino 2020; Kanna and Santhi 2022). The ConvLSTM architecture for video recognition is depicted in Fig. 4.

It should be noted that both ConvLSTM and CNN-LSTM architectures serve the same functional purpose: extracting spatio-temporal features from video data for video recognition tasks. However, they differ in structure. Indeed, ConvLSTM incorporates convolution in the architecture, whereas CNN-LSTM externally combines the two types of networks by concatenating their outputs together. As defined, the ConvLSTM network effectively captures localized spatio-temporal correlations. It accomplishes this by employing a convolution operator to predict the future state of a specific cell in the grid using inputs and previous states of its local neighbors (Vrskova et al. 2022).

ConvLSTM proves to be suitable for processing images and videos with temporal dependencies, achieving significant results. But, its weaknesses lie in the substantial computational demands and high memory consumption.

2.4 Three-dimensional convolutional neural network (3D CNN)

Among the deep learning architectures used for video recognition, there are the 3D convolutional neural networks (3D CNNs), known as a spatio-temporal networks. They play a prominent role in exploiting spatial and temporal features (Tran et al. 2015). The idea involves expanding the 2D spatial CNN, based on Conv and pooling layers, into a 3D spatio-temporal CNN. This extension aims to analyze and then recognize videos.

3D CNNs are similar to 2D CNNs, but with two primary differences. First, they are designed to capture the temporal relations between video frames, by using three-dimensional kernels. This is achieved by processing sequences of frames rather than individual ones. Additionally, a 3D CNNs can learn the three-dimensional features of video sequences, and generate 3D feature maps, a capability that is impossible with 2D CNNs. An example of a 3D CNN architecture is presented in Fig. 5. As it is displayed, a 3D CNN can be composed by a succession of Conv3D layers, ReLU activation functions, and 3D pooling layers.

3D convolution layer (Conv3D) In the same way that the convolution layer (CONV or Conv2D) is the basic component of the CNN, the three-dimensional convolution layer (Conv3D) is equally the fundamental element of the 3D CNN. Indeed, the CONV layer lacks the temporal information in each convolution operation. In contrast, the 3D convolution maintains the temporal information from the input data, producing 3D feature maps as an output volume. The input to Conv3D is convolved across four dimensions: two spatial dimensions (width and height), one channel dimension, and one time dimension (frame). During the convolution process, the 3D CNN generates a three-dimensional activation map. This feature map serves for data analysis and for the incorporation of temporal context. In this operation, three-dimensional filters are applied, where the kernel moves along three directions, as visually depicted in Fig. 6b. The resultant output has the form of a 3D volume space. The 3D convolution operation is accomplished by wrapping around the center of a cube and stacking adjacent layers on top of each other (Vrskova et al. 2022). The motion information is captured by the interconnections between the feature maps.
3D Pooling layer The 3D pooling layer has the same purpose as the pooling layer used in 2D CNN structure. It acts as a nonlinear down sampling operation for an input tensor. Its aims to reduce the spatial dimensions of an image, while retaining only the most significant pixels. When applying a 3D pooling in the 3D neural network construction, the pooling size must consist of three values, reflecting the 3D data being dealt with. In fact, this operation involves the division of the input tensor’s data into smaller 3D subtensors along all three dimensions. Afterwards, in each subtensor, the element with the highest numeric value is selected. This process converts the input tensor into an output tensor, in which each subtensor is replaced by its respective element (maximum, minimum or average element). This is because there are three commonly employed techniques: Max pooling (selecting the highest value), Min pooling (selecting the lowest value), and Average pooling (calculating the average of the values). MaxPooling3D is frequently used in the case of color images and its visual representation can be seen in Fig. 7.

This exploitation of temporal context is a notable advantage of 3D CNNs in video analysis. This is thanks to the four dimensions (two spatial dimensions, one channel dimension and one temporal dimension), that allow all types of temporal interactions between adjacent frames to be easily learned. The 3D CNN architecture is not only uncomplicated, but it is also fast, and easier to train, particularly when compared to CNN-LSTM. Especially with sufficient data, the 3D CNN is the most efficient architecture, as a spatio-temporal network for video recognition. A limitation of this 3D CNN is that the increase of the input dimensions leads to a significant rise in both memory and computational requirements (Tran et al. 2015).

3 Related works

In the literature, the aforementioned deep learning architectures, from the previous section, have been suggested for recognizing fires in videos, through spatio-temporal analysis. Hence, the DL-based fire recognition approaches have been considered as a significant challenge because of specific information nature contained in videos, particularly the temporal continuity of fire movement. Indeed, it is not only about the basic two-dimensional space of a frame; but the incorporation of previous and subsequent frames is also crucial to effectively capture the temporal information.

The spatio-temporal two-stream convolutional neural network is employed in Shin et al. (2018) to introduce a fire recognition model. A spatio-temporal two-stream convolutional neural network-based fire recognition method is introduced. The spatial stream employs the VGG-16 network, while the temporal model is built using the 3D CNN network. For both streams, transfer learning is applied. The output vectors from each stream are concatenated through the fusion method in the fully connected layer. With this type of architecture, the accuracy of the obtained classification model is enhanced, and the false positives are reduced. The proposed model demonstrates superior performance compared to the 2D CNN model. However, its computational cost is high. Another approach, presented in Rasool Abdali and Ghani (2019), is based on the CNN-LSTM architecture for a real-time fire detection. Initially, a CNN is applied to spatially extract features. Then followed by the LSTM cells, the temporal relation learning method is used to build the model. To feed the data, the time distribution technique is utilized. Experimental results show an improvement in fire recognition performance, achieving a promising level of accuracy.

To analyze and recognize the flame regions in both spatial and temporal domains, this CNN-LSTM architecture is also used in Abhilash (2023). Herein, a fire candidate extraction stage is introduced to detect fires of varying sizes. Then, the CNN-LSTM is employed to analyze the small and clipped fire images. Despite the achieved effectiveness, this approach suffers from the instability of the fundamental features (Abhilash 2023).

In the same way, the spatial and temporal features are exploited by Zhikai Yang et al. in Yang et al. (2020). Three novel fire recognition models are developed, based on two lightweight CNNs (ResNet-18 and MobileNets) and the Simple Recurrent Unit (SRU). The SRU, which is improved from LSTM by adding a reset gate, is a variant of Recurrent Neural Network (RNN). The first model combines ResNet-18 with the SRU, and MobileNet is merged with SRU for the built of the second model. In the third one, a 3D Conv layer is added between the MobileNet and SRU components, taking into account the indoor settings and flame characteristics. The resulting models prove their efficiency, compared to the other CNN-based fire recognition models, via single frame. It is also demonstrated that the third developed model (MobileNet + 3D Conv layer + SRU) reaches the best performance. Hence, the importance of the temporal aspect is confirmed.

Likewise, another type of DL architecture, using a CNN combined with a ConvLSTM, is presented in Verlekar and Bernardino (2020) to classify video sequences into either fire or non-fire categories. A frame selection step is firstly applied in this approach to process video sequences of variable durations. Indeed, 15 frames are selected from every input scene. Each frame is then fed into a CNN, named Xception, in order to extract static features. Subsequently, spatio-temporal features are extracted using a ConvLSTM. Finally, the output of the last cell of the ConvLSTM is passed to the fully connected network for the classification purpose. According to this method, the achieved classification accuracy validates the effectiveness of the proposed model in classifying video sequences.

The use of ConvLSTM models is also shown in Masrur et al. (2024), where they identify and exploit the scale-dependent spatiotemporal interconnections between the space-time of the fire event. Two attention-based spatiotemporal models are proposed in this study for predicting the wildfire progression. Another related work (He et al. 2024) combines ConvLSTM with a CNN and a Vision Transformer (ViT) for fire recognition. In He et al. (2024), the CNN-based network is used to extract local features, while the Vision Transformer (ViT) serves as a global feature extraction method employing multi-head attention to gather information across the entire image. Combining CNN with ConvLSTM adds the ability to consider image data features within a spatiotemporal context. Promising findings are reached demonstrating the effectiveness of the presented approach. However, the model’s precision should be further optimized by integrating more dependable datasets.

A spatio-temporal network, designed specifically for night-time wildfire recognition in videos, is provided in Agirman and Tasdemir (2022). The used architecture combines a CNN with a variant of LSTM, known as the bidirectional long short-term memory (BLSTM). In fact, a BLSTM cell employs two interconnected LSTM cell engines. The proposed network initially extracts spatial features from videos, through the use of a pretrained GoogLeNet network, with transfer learning applied to this CNN. These extracted features are fed into a BLSTM network for temporal learning. The final layers of the proposed architecture consist of a fully connected layer with two output classes, followed by a SoftMax layer for probability calculations. Despite the promising detection results, instances of misclassification persist.

Similarly, another DL-based fire recognition approach is proposed in Vu et al. (2021). In this method, flame regions are initially detected by exploiting the motion and color features of the fire. In the subsequent phase, a combination of a CNN and LSTM is applied to determine whether the flame is a true fire or a non-fire moving object. In fact, the CNN architecture is based on ResNet-18 to extract spatial features. The LSTM model is used to extract temporal features in videos. Experimental results demonstrate that the suggested approach performs well in terms of accuracy, with rapid processing speed. Aiming to further improve the obtained results, authors suggest in Nguyen et al. (2021)an enhanced multistage fire detection method. It is also based on the CNN-LSTM architecture. As the previously reported work (Vu et al. 2021), the candidate fire regions are firstly detected using both a color model and the computed flicker energy. The images from every candidate flame region are then passed through the convolutional network, to extract spatial features. For this purpose, the CNN, based on the pretrained ResNet-18 model, is finetuned. These features are later on input to a multilayer Bidirectional LSTM network, which temporally merges the extracted information. This BiLSTM is dedicated to classify the fire/non-fire sequence images. The outcome of the proposed method achieves the highest F1-score, validating its effectiveness in accurately recognizing fires with minimal false positives. However, the limited availability of fire image data remains a limitation of this method, which ultimately impacts its overall performance.

After the above discussion of different deep learning approaches, it is clear that the task of recognizing fire in video surveillance scenes presents a spatio-temporal challenge. It can be concluded that these works exhibit a reliable performance, when compared to deep learning approaches devoted for fire images classification. Accurately capturing the spatial and temporal features of the flame object, yields to achieve good results. Besides, it is deduced that the exploitation of the spatio-temporal features, particularly through the use of deep learning architectures, remains limited in the fire recognition filed. Nevertheless, the presented related works frequently encounter issues of high cost in terms of both time and computation. This is due to the structure of the spatio-temporal architectures, leading to a high parameters number. Consequently, the memory requirements are increased, owing to the heavy architectures and the large number of video frames, needed for the spatial and temporal training. Therefore, up to now, developing a spatio-temporal architecture with one stream network and fewer trainable parameters is still the ongoing researchers focus. This aim is to mitigate the challenges posed by memory and computational demands.

4 3D FireClassNet: a novel spatio-temporal convolutional neural network for fire recognition in video surveillance scenes

The suggested approach is a novel DL-based method for fire recognition. It is based on the exploration of spatial and temporal information, available in a fire video surveillance scenes. It is designed considering the 2D convolutional neural network “FireClassNet”, presented in Daoud et al. (2023), which exhibits highly efficacy for static image recognition tasks. The effectiveness of this 2D CNN lies in its ability to process spatial information in fire images. However, fire videos inherently contain both spatial and temporal data, which 2D CNN struggles to capture. Because of this limitation, the “FireClassNet” model overlooks crucial motion information in fire videos, making it unsuitable for fire video analysis task (Daoud et al. 2023). As a result, it may not correctly predict every video frame, due to rapid changes between successive frames.

Unlike 2D CNN structure, which focuses only on spatial information, the 3D CNN structure adds a third dimension, time, enabling it to capture spatio-temporal information in videos. The motion and changes across frames are detected through the 3D CNN making it ideal for fire video recognition task. The goal of this study is to create an end-to-end 3D CNN designed to classify videos as fire or non-fire by extracting spatial and temporal features. Our major motivation is to remedy the limitation of the “FireClassNet”, the 2D CNN architecture presented in Daoud et al. (2023), and to automatically recognize fire videos while improving accuracy and decreasing false alarms number. To achieve this, it is suggested to design 3D CNN structure named “3D FireClassNet”, inspired by the “FireClassNet” architecture. This structure considers the temporal dimension that enhances the capability of the network to recognize complex relationships between spatial and temporal features in videos. Nevertheless, it is important to take into account the increased computational requirements of a 3D CNN architecture. Hence, another objective of this work is to develop the 3D CNN model with fewer parameters, to be used even in mobile systems with limited memory and processing capacity.

The different phases of the proposed approach are displayed in Fig. 8. The process begins by preprocessing the created dataset to enlarge the input data. Subsequently, the development of the fire recognition model, is carried out involving a novel deep learning network that automatically extracts spatio-temporal features. With these extracted features, the model is trained and then tested on a new data. The performance of the designed model is evaluated, using different evaluation metrics. The details of each phase are described in the following two subsections.

4.1 Preprocessing

Before feeding the data to the designed network, preprocessing is conducted. This phase is crucial, as it has a significant impact on overcoming the shortcomings of the created data, such as insufficient data and its lack of diversity. The process involves making essential adjustments and transformations to the data. Since the deep learning models need large amounts of data, it is proposed in our approach to generate additional data by augmenting the collected dataset. Specifically, video data augmentation techniques, including horizontal and vertical flipping and rotation, are employed to expand our dataset and enhance its diversity. Some samples are shown in Fig. 9.

Each video of the constructed dataset undergoes various augmentations, including horizontal flips, vertical flips, and rotations. The augmented dataset, now comprising a diverse set of samples, is then used to train the deep learning model. This preprocessing phase not only enhances the data quantity by introducing different variations of each sample but also improves the model’s performance by mitigating overfitting. By augmenting the dataset in this way, we ensure a richer training set for the model, which should lead to improved results and better generalization.

4.2 Presentation of the “3D FireClassNet” architecture

As it is observed in the literature review, the existing works for fire recognition in videos, particularly those that exploit the spatio-temporal features, are limited and still in progress. Most of works use the combination of two networks: the first one is trained with spatial features and the second one is trained with temporal features. This results in a high time consumption and large memory requirements. In order to explore spatial and temporal features in a single network and to avoid the huge training and prediction times, a novel deep fire recognition model is needed nowadays. Motivated by these statements, the main contribution of the presented approach is the development of a 3D CNN architecture for fire videos recognition, by exploiting spatio-temporal features.

Inspired by the 2D CNN “FireClassNet”, suggested in Daoud et al. (2023), our 3D CNN, named “3D FireClassNet”, is designed to simultaneously handle spatial and temporal features. The differences are evident in the convolution and pooling layers, as well as in the employed filter kernel sizes, with the addition of the temporal dimension. Hence, the kernel of a 3D convolution layer is expressed as (H × W × dt), where H and W represent the height and width of the convolution kernels on the 2D plane, and dt denotes the depth of the convolution kernel, representing the time dimension. The 2D and 3D structures of the convolutional layers are depicted in Fig. 10. In this example, H and W are both set to 1 and dt is set to 3. Further details are given in the Sect. 2.4.

An overview of the proposed “3D FireClassNet” network is presented in Fig. 11. It consists of four main blocks, without taking into account the input and output data. Three successive blocks of convolutional layers, each accompanied by pooling layers, are intended for feature extraction. The final block consists of fully connected layers designed for video recognition tasks.

The initial block is composed of a sequence comprising a 3D convolution layer, a ReLU non-linear activation function, a batch normalization layer, and a 3D pooling layer. The second and third blocks of layers have the same structure, where the 3D convolution and non-linear layers are doubled, followed by batch normalization and 3D pooling layers. The successive layers in these three blocks serve to extract spatial and temporal features, thanks to the Conv3D and MaxPooling3D layer architecture. It’s important to note that the ReLU non-linear activation function is applied after each Conv3D layer, in order to enhance the learning speed. Besides, a normalization operation is carefully added before the 3D pooling layer. This serves the dual purpose of regulating the distribution of inputs to the hidden neurons and enhancing the training speed and overall performance. After the extraction of spatial and temporal features through the three first blocks, the fourth block is dedicated to recognize fire in videos. The input of this block is a vector that reshapes the extracted features into 1D array, which is then processed by the dense layers. This block consists of a fully connected layer, with 1024 neurons, followed by a ReLU, batch normalization, and dropout layers. Subsequently, there is another fully connected layer with two neurons, and finally a SoftMax layer, as the constructed dataset contains two distinct classes.

In this architecture, 22 frames from each preprocessed video data, resized to $64 \times 64$, serve as inputs to the “3D FireClassNet” network. Thus, the input shape for the 3D convolution layer is (64, 64, 22, 3), where $64 \times 64$ represent the width and height of the frame. 22 corresponds to the number of selected frames in a video sequence, known as the depth hyperparameter. The 3 value is the number of channels. The initial 3D convolution layer employs 16 three-dimensional filters of size $3 \times 3 \times 3$ on the input data. All subsequent 3D convolutional layers use filter kernels with a size of $3 \times 3 \times 3$. These layers are introduced as spatio-temporal convolutional layers with a stride of 3. Every filter moves in three directions (x, y, t) to compute the feature representations, as seen in Fig. 6. The output of the 3D convolution is a feature map, represented as a 3D volume (64, 64, 22, 16). This resulting feature map is required for data analysis, spatial and temporal context. This layer contains 1312 trainable parameters. As defined, the trainable parameters of a 3D CNN structure encompass all the weights (W) and biases (B) in the network. These weights and biases constitute the two types of parameters in each layer. In the case of a 3D Conv layer, the sum of weights and biases returns the number of parameters ($Param_{Conv}$), computed via the following formula:

$$\begin{aligned} Param_{conv} = W_{conv} + B_{conv} = (K*K*K*C+1)*Nbr_{filters} \end{aligned}$$

(1)

where $W_{conv}$ represents the number of weights ($W_{conv}= K*K*k*C*Nbr_{filters}$), with K is the dimension of the used 3D filter kernel, C is the number of channels in input images, and N denotes the filters number, also representing the biases number ($B_{conv}=N$). Table 1 provides a detailed analysis of how these trainable parameters are derived.

After the initial Conv3D layer, a MaxPooling3D layer, with a size of $2 \times 2 \times 2$, is applied. It is used to reduce the dimensions of data from an input size equal to (64, 64, 22, 16) to (22, 22, 8, 16). The output dimensions of each 3D max pooling operation follow the same equation as the resulting 2D max pooling output size, provided by Eq. (2).

$$\begin{aligned} Output_{size} = \left( \frac{Input_{size} - Filter_{size} + 2 x Padding}{Stride}\right) + 1 \end{aligned}$$

(2)

Also, the depth value is reduced from 22 to 8, using the same formula. As it is indicated in Table 1, the MaxPooling3D layer has no trainable parameters, unlike the batch normalization layer, which has 4 parameters, resulting in a total BN parameters: $Param_{BN}=4*N$. Consequently, the total number of parameters in this first block is 1376. It is obtained by summing the parameters of the Conv3D and BN layers (1376 = 1312 + 4 * 16).

Table 1 The parameters of the proposed “3D FireClassNet” network

A one stream three-dimensional convolutional neural network for fire recognition based on spatio-temporal fire analysis

Abstract

Similar content being viewed by others

Video Based Fire Detection Using Xception and Conv-LSTM

Fire Detection Model Using Deep Learning Techniques

Early Wildfire Detection Using Convolutional Neural Network

Explore related subjects

1 Introduction

2 Deep learning architectures for spatio-temporal analysis

2.1 Two-stream architectures

2.2 Convolutional neural network and long short-term memory (CNN-LSTM)

2.3 Convolutional long short-term memory (ConvLSTM)

2.4 Three-dimensional convolutional neural network (3D CNN)

3 Related works

4 3D FireClassNet: a novel spatio-temporal convolutional neural network for fire recognition in video surveillance scenes

4.1 Preprocessing

4.2 Presentation of the “3D FireClassNet” architecture

5 Experimental evaluation and discussion

5.1 Experimental protocol

5.1.1 Dataset description

5.1.2 Training and validation of the 3D model

5.1.3 Experimental architecture tuning

5.2 Evaluation of the “3D FireClassNet” architecture

5.2.1 Validation of the filters number

5.2.2 Experimental hyperparameters tuning for an enhanced “3D FireClassNet” architecture

5.3 Comparative study

5.3.1 Comparison with the existing video-based fire detection approaches

5.3.2 Comparison with the related deep learning models based on spatio-temporal networks for fire recognition

5.4 Comparison between “3D FireClassNet” and “FireClassNet” architectures

6 Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation