1 Introduction

Every year, people drown all over the world. Tens of thousands of people die of drowning every year due to the accidental drowning of crew, tourists, capsizing and sinking of ships and so on. The main reason is that the current is turbulent and the water area is large, so it is difficult to find and locate the person falling into the water. With the upgrading of computing hardware and the optimization of artificial intelligence algorithms, image processing and detection have been applied to solve all kinds of problems, but the problem of drowning person detection still needs to be solved urgently.

Image fusion is an enhancement technology. Its purpose is to combine the images obtained by different types of sensors to generate images with stronger robustness or richer information, so as to facilitate subsequent processing or help decision-making. Infrared and visible image fusion has advantages in many aspects.

First, their signals come from different forms, which provide different aspects of scene information, that is, the visible image captures the reflected light, while the infrared image captures the thermal radiation [1]. Therefore, this combination is more informative than the single-mode signal. Second, infrared and visible images show the inherent characteristics of almost all objects, which can be obtained by relatively simple equipment [2]. Finally, infrared image and visible image have complementary characteristics, so as to produce robust and informative fusion image. Visible images usually have high spatial resolution and considerable detail and light–dark contrast. Therefore, they are in line with human visual perception. However, these images are easily affected by bad conditions, such as insufficient lighting, fog and other bad weather. Infrared images describing the thermal radiation of objects can resist these interferences but usually have low resolution and poor texture. Due to the universality and complementarity of the images used, visible and infrared image fusion technology has a wider application field than other fusion technologies.

The fusion of visible and infrared images is of great significance for personnel detection, especially for people falling into the water. First of all, if only visible light images are used for detection, people are in the fast flowing and unclear river. In addition, the proportion of people exposed to the water when falling is very small, and the people falling into the river are almost integrated with the river, which is difficult to distinguish between the naked eye and the camera [3]. Even excellent detection algorithms are difficult to detect accurately, and the light conditions are good and fashionable, it cannot be detected at night or in heavy fog [4]. The infrared image can distinguish people from the background well. Because the human body has a higher temperature than the river water, the brightness of the human body reflected in the infrared image will be higher than the river water, so it is more prominent [5]. However, due to the low resolution and lack of texture features, the infrared image can only obtain rough contour information [6].

If there are high-temperature objects similar to the shape of the drowning person in the picture, it is easy to cause misjudgment and missing judgment, as shown in Fig. 29.1. If the visible and infrared images are fused, the image not only highlights the human body but also contains certain texture features, which will greatly improve the detection accuracy and recall.

Fig. 29.1
figure 1

Missing judgment of drowning personnel

2 Image Acquisition and Registration

2.1 Image Acquisition

The study uses a dual light camera that can obtain visible and infrared images at the same time. One side is an optical camera and the other is an infrared thermal imaging camera.

2.2 Image Registration

Because infrared and visible images are obtained by different sensors, they are usually different in size, perspective and field of view [7]. The above dual light camera will also bring different viewing angles. However, successful image fusion requires strict geometric alignment of the fused image, so it is necessary to register the visible and infrared images before fusion. The registration of infrared image and visible image is a multi-mode registration problem.

For the registration problem here, the feature-based registration method is used. The feature-based method first extracts two groups of salient structures, then determines the correct correspondence between them, and estimates the spatial transformation accordingly, which is then used to align a given image pair.

The first step of feature-based method is to extract robust common features that can represent the original image. Edge information is one of the most commonly used choices in infrared and visible image registration, as shown in Fig. 29.2, because different registration methods can well preserve the size and direction of edge information. Edge mapping can be discretized into point sets. A popular strategy to solve the point matching problem includes two steps: calculating a set of assumed correspondences and then removing outliers through geometric constraints. By calculating feature descriptors at points, the matching between points with too large descriptor difference is eliminated, random sample consistency (RANSAC) is used to remove false matching from the assumed set, and the hypothesis verification method is used to obtain the minimum possible outlier without subset through resampling to estimate the given parameter model.

Fig. 29.2
figure 2

Edge information features and image registration process

3 Image Fusion

3.1 Converged Network

A deep learning architecture for infrared and visible image fusion is adopted. Compared with the traditional convolution network, the coding network is combined with convolution layer, fusion layer and dense block, in which the output of each layer is connected with each other. Using this architecture, we can obtain more useful features from the source image in the coding process, select the appropriate fusion strategy to fuse the features, and finally reconstruct the fused image through the decoder.

As shown in Fig. 29.3, before fusion, the depth features of visible and infrared images are extracted, the first convolution layer extracts rough features, and then three convolution layers (the output of each layer is cascaded into the input of subsequent layers) form dense blocks. Such an architecture has two advantages. First, the size of the filter and the step of convolution operation are 3 respectively ×3 and 1. Using this strategy, the input image can be any size; Second, dense blocks can retain depth features as much as possible in the coding network, and this operation can ensure that all salient features are used in the fusion strategy.

Fig. 29.3
figure 3

Fusion network structure

As shown in Fig. 29.4, L1 norm and soft-max operations are applied in the fusion layer. The fusion layer includes a plurality of convolution layers (3 × 3), the output of the fusion layer will be the input of the convolution layer. This simple and effective architecture is used to reconstruct the final fused image.

Fig. 29.4
figure 4

L1 norm and softmax operation are applied in the fusion layer

3.2 Loss Function

The loss function of fusion network is composed of pixel loss function LP and structural similarity loss function Lssim weighting results in:

$$L_{p} = \left\| {O - I} \right\|^{2}$$
(29.1)
$$L_{ssim} = 1 - SSIM(O,\;I)$$
(29.2)
$$L_{fus} = \lambda L_{ssim} + L_{p}$$
(29.3)

where O and I represent an output image and an input image, respectively. Lp is the Euclidean distance between output O and input I. Lssim represents the structural similarity, which represents the structural similarity of two images. This index is mainly composed of three parts: correlation, brightness loss and contrast distortion. The product of the three components is the evaluation result of the fused image. Since there are three orders of magnitude differences between pixel loss and Lssim loss, in the training phase, the λ set to 1000.

4 Detection of Personnel Falling into the Water

4.1 Detection Network

The convolutional neural network CNN is used to recognize the target of the drowning person. The central idea of the detection network is to divide the picture into S × S areas. If the center of an object falls on a cell, the cell is responsible for predicting the object. Each cell needs to predict multiple bounding box values, predict a confidence level for each bounding box, and then conduct prediction analysis in units of each cell.

The backbone network adopts the modified darknet-53, as shown in Fig. 29.5. This network has high classification accuracy, fast calculation speed and few network layers. The full connection layer is removed. The network here is a full convolution network, which uses a large number of residual layer hopping connections. In order to reduce the negative gradient effect caused by pooling, the pooling layer is abandoned and the step size of the convolution layer is used to realize downsampling. In this network structure, the convolution with step size of 2 is used for down sampling.

Fig. 29.5
figure 5

Darknet-53 backbone network

The network outputs three feature maps of different scales, draws lessons from FPN, and uses multi-scale to detect targets of different sizes. The finer the unit, the finer the object can be detected.

Before model training, it is first necessary to make a dataset of fusion images, capture visible and infrared images with a dual light camera, obtain the fusion images through the above registration and fusion process, label the drowning personnel, make a dataset in the format required for training, and select the pretraining model for training, The algorithm model that can identify the drowning person in the visible and infrared fusion image is obtained. Then evaluate the accuracy of the model and optimize it from the aspects of data set and algorithm, so that it can achieve a better recognition effect.

4.2 Loss Function

The loss function of the detection model is divided into three parts, Lbox brought by bounding box, Lobj caused by confidence, error Lcls brought by category:

$$L_{{box}} = \lambda _{{coord}} \sum\limits_{{i = 0}}^{{S^{2} }} {\sum\limits_{{j = 0}}^{B} \begin{gathered} 1_{{i,\;j}}^{{obj}} \left( {2 - w_{i} \times h_{i} } \right) \left[\left( {x_{i} - \hat{x}_{i} } \right)^{2} + \left( {y_{i} - \hat{y}_{i} } \right)^{2} \hfill \right. \\ + \left. \left( {w_{i} - \hat{w}_{i} } \right)^{2} + \left( {h_{i} - \hat{h}_{i} } \right)^{2} \right] \hfill \\ \end{gathered} }$$
(29.4)
$$L_{cls} = \lambda_{class} \sum\limits_{i = 0}^{{S^{2} }} {\sum\limits_{j = 0}^{B} {1_{i,j}^{obj} \sum\limits_{c \in classes} {p_{i} (c)log\left( {\hat{p}_{i} (c)} \right)} } }$$
(29.5)
$$L_{obj} = \lambda_{nobj} \sum\limits_{i = 0}^{{S^{2} }} {\sum\limits_{j = 0}^{B} {1_{i,j}^{nobj} \left( {c_{i} - \hat{c}_{i} } \right)^{2} + \lambda_{obj} \sum\limits_{i = 0}^{{S^{2} }} {\sum\limits_{j = 0}^{B} {1_{i,j}^{obj} \left( {c_{i} - \hat{c}_{i} } \right)^{2} } } } }$$
(29.6)

The detection loss function is the sum of the above three errors:

$$L_{dec} = L_{box} + L_{cls} + L_{obj}$$
(29.7)
$$L = L_{fus} + L_{dec}$$
(29.8)

5 Detection Fusion Reverse Guidance

The purpose of common visible infrared image fusion technology is to make the fused image contain as much information of two kinds of images as possible, neither lose the contrast information in the infrared image nor the texture information in the visible image, or make the fused image more in line with the human visual system, Therefore, the loss function of the initial fusion process is defined as the weighted sum of the pixel loss function and the structural similarity loss function.

The focus of this system is to accurately detect the person falling into the water. The result of image fusion is only an intermediate process. Whether it is image fusion or detection process, its optimization should take accurate detection as the ultimate goal. In order to achieve this ultimate goal, the training of image fusion should be modified so that the loss function in the detection process can guide the fusion, and the final detection results will be optimized in the fusion stage.

As shown in Fig. 29.6, first mark the person falling into the water on the registered visible or infrared image. Since the image has been registered and aligned, and the target position after fusion remains unchanged, the mark can be copied to the fusion image as the ground truth. After the fusion image passes through the detection network, the predicted boundary box, classification and confidence are obtained, and the detection error is calculated by comparison with the mark, i.e. Ldec, this loss function is used not only to evaluate and optimize the detection network but also to evaluate and optimize the fusion network. It is equivalent to the loss function of the fusion network, and the following corrections are made:

$$L = L_{fus} + L_{dec}$$
(29.9)
Fig. 29.6
figure 6

Dual registration image fusion detection

Furthermore, we developed a waterway operational monitoring system based on this study and demonstrated its application in China’s inland waterway, as shown in the figure below. The test shows that the recognition accuracy of the system at night is more than 95% (Figs. 29.7, 29.8, 29.9 and 29.10).

Fig. 29.7
figure 7

Dynamic identification effect of crew in daytime

Fig. 29.8
figure 8

Identification effect of crew at night

Fig. 29.9
figure 9

Detection results of people falling into the water under thermal imaging mode (depression angle 9°)

Fig. 29.10
figure 10

Detection results of people falling into the water under thermal imaging mode (depression angle 8°)

6 Conclusion

In this study, a drowning person detection method based on visible light and thermal imaging data fusion is proposed. The method includes image acquisition, image registration, image fusion and target detection. It is a complete, feasible and practical system, which can be used to detect drowning persons for subsequent positioning and rescue.