1 Introduction

The autonomous marine vehicle (AMV) [1] has a rapid development in recent years for kinds of application requirement, such as kinds of in civilian, commercial and military maritime mission application including ocean surveying, environment monitoring, anti-submarine warfare, weapons delivery and electronic warfare support. Although the aim of the AMV is unmanned and automatic, it is typically semi-autonomy due to the complex environment. The latest developments in advanced techniques have greater suppuration than ever before on AMV [2]. The fixed and moving targets can be recognized by the navigation equipment and systems in the navigation of vessel, like the cameras, Automatic Radar Plotting Aid (ARPA), and Automatic Identification System (AIS). ARPA can provide accurate information of nearby obstacles, including the range and bearing information [3]. AIS is a special system which can provide all the information about the vessel, including the structural data, position, course, and speed. The researchers have make use of these new techniques to make the AMV less reliant on operator interactions and realize a full automatic AMV. The main challenge with regard to the automatic navigation, guidance and control (NGC) of AMV, is the obstacle recognition and appropriate collision avoidance maneuvers to minimize the dependency on operator intervention. Many advanced theory and algorithms are adopted. Fuzzy theory is an important artificial intelligence technique imitating the human reasoning ability, used for AMV NGC [4,5,6]. Making use of the fuzzy decision making system and neural fuzzy interface, the human-like behavior imitation system is developed for the collision avoidance criteria reasoning [7,8,9]. The collision avoidance problem is also considered as an optimization problem. Therefore, many excellent meta-heuristic optimization algorithms are adopted to realize collision avoidance planning [10,11,12,13,14,15], such as the evolutionary algorithm (EA) [16,17,18,19,20,21,22], ant colony optimization (ACO) [23], particle swarm optimization (PSO) and artificial immune algorithm [24]. As mentioned above, when collision avoidance methods are associated with typical optimization algorithms, the route is generated according to the key waypoints. However, in some cases, such as busy channels, harbors, and so on, the dynamic environment, such as the moving obstacles, might be quite significant, substantially increases the collision risk [25]. Therefore, it may be impossible to set the waypoints in advance, or consider them as a rigid one. The vision system is a very important part for vessel when the environment is unknown or the obstacles are dynamic [26, 27]. Normally, the collision avoidance operation is always based on the vision of the crewman or the human interactions in the narrow channel or in a busy harbor. However, the full automation of AMV maneuvers based on vision, like the collision avoidance based on crewman, has a slower development which is based on machine learning technology [28, 29].

In recent years, the deep learning techniques have made tremendous progress in machine vision. The deep learning method has become more and more richer, including kinds of unsupervised and supervised learning algorithms, including the deep restricted Boltzmann machines, deep convolutional network, deep recurrent neural network, deep generative models [30]. Deep neural network has made a lot of important results, including the image classification, target detection and image understanding.

In conventional neural network, the input is always a lower-dimensional vector. However, it can be a higher dimensional one that represents high-resolution image for the convolutional neural network (CNN). CNN is a special case of neural network which is designed to take advantage of the 2D structure of an input image. CNN, it makes use of the neighborhood characteristics and uses the weight sharing to reduce the number of parameters. CNN is easier to train than the fully connected networks with the same hidden units. LeNet was one of the first successful convolutional neural networks which named by Yann LeCun [31]. This pioneering work was mainly used to recognize the zip codes, digits, etc. Although the deep property is the inherent properties, the application of deep CNN is limited. The main problem may be the limited computing power and the data deficiency. With development of the computing power of computer and proposition of big data, CNN could tackle more and more interesting problem. In 2006, the deep architecture is proposed by Hinton [32], and then the deep neural networks become popular. The deep CNN has been widely applied in image and speech recognition. Alex Krizhevsky et al. [33] proposed the famous Alexnet which is the extension of the LeNet, and it won the ImageNet Large Scale Visual Recognition Challenge by substantial advantage in 2012 [33]. Comparing with the previous approaches, it has a great advantage. Some subsequent models are modified by Alexnet and applied for kinds of fields.

In this paper, Alexnet is adopted to learn the automatic maneuver characteristics of the AMV and realize the automatic maneuver of AMV based on the vision system. Alexnet contains component of five convolutional layers and two fully connect layers. The down-sampling layer adopts the max-pooling operation, and a normalization layer is added to regulate the distribution of the internal information. The optical images are taken from the deck side camera. After by the training phase of the Alexnet using sample data, the AMV can be able to steer and navigate autonomously and efficiently in an unknown environment which means that it can learn the steering characteristics through the sample data.

This paper is organized into five sections. Section 2 presents the structure of the convolutional neural network and the training rule. Section 3 presents the AMV collision avoidance technique. Section 4 exhibits the convolutional neural network training process. Section 5 concludes the paper.

2 Convolutional Neural Network

The convolutional neural network (CNN) is a unique artificial neural network taking advantage of the 2D structure of images. CNN makes use of the neighborhood characteristics of image and uses the weight sharing to reduce the parameter numbers [34]. The convolutional kernel is shared by the whole image as shown in Fig. 1; therefore, the parameter of the CNN is the convolutional kernel unlike the fully connected one which is the pixel of the image. CNN is usually consists of several layers, such as convolutional layer, pooling layer and fully connected layer as Fig. 1.

Fig. 1
figure 1

CNN general structure

Alexnet consists of five kinds of layers, including convolutional layer, pooling layer, normalization layer, dropout layer and fully connected layer. The convolutional layer convolutes the input image with a convolution kernel and then outputs a smaller feature map after an activation computation. The pooling layer can reduce the dimension of the feature maps by several ways (max pooling, average pooling etc.). A batch normalization layer is added in order to adjust the distribution of the data; and also a dropout strategy is followed after the batch normalization in order to overcome the over-fitting problem. Finally, the fully connected layer is used. In this part, we will describe theses sections in detail.

  1. (1)

    Convolutional layer

In the convolutional layer, the input images are convoluted by the convolution kernel as Eq. (1).

$$\begin{aligned} ac_{i,j}^{l} & = f\left( {\sum\nolimits_{m = 0}^{M} {\sum\nolimits_{n = 0}^{N} {w_{m,n}^{l} x_{i + m,j + n}^{l - 1} + b^{l} } } } \right) \\ & = f\left( {conv(w_{m,n}^{l} ,x_{i + m,j + n}^{l - 1} ) + b^{l} } \right) \\ \end{aligned}$$
(1)

l is the convolution layer number of the CNN. In deep architecture, the input of current layer is the output of the previous layer. M and N are the kernel size which is always M = N. \(x_{i,j}^{l} x_{i,j}^{l}\) is the input pixel of the image at column i and row j, \(w_{m,n}^{l}\) is the weight of the convolution kernel at column m and row n, and \(b^{l}\) is the bias value. The convolution kernel slides on the image, and there will be a feature map output according to the operation. If there are many kernels, there will be corresponding feature maps as kernel number. \(ac_{i,j}^{l}\) is the pixel value of the feature map at column i and row j. conv() is the convolutional function. f() is the active function, such as the sigmoid() and Relu() function. sigmoid() function is a typical activate function used in artificial neural network which is able to compress the real input values into 0–1, and its derivatives have good smoothness. However, for this very reason, the gradient dispersion occurs in the error back propagation process as shown in Fig. 2. Relu() function is a very simple piecewise linear model used in the forward calculation [35, 36]. Its partial derivative is simple and without compression. It is not prone to gradient diffusion problems; therefore, it is always adopted in the deep neural network.

Fig. 2
figure 2

Illustration of the Relu and sigmoid functions graph diagram

The convolutional operation of CNN is based on a 2-D structure; the following figure exhibits this operation (Fig. 3).

Fig. 3
figure 3

Convolutional operation process of CNN

  1. (2)

    Pooling layer

Another important concept of CNN is the pooling operation, which is a kind of nonlinear down-sampling. By eliminating the smaller values, it reduces the computation requirement for the following layers. There are many pooling methods. Max Pooling is the preferred one. Max pooling outputs the maximal value of the non-overlapping region of the original image, and the image is recreated by the maximum value with a smaller size. Max pooling has the property of robustness to position; it is a smart way to reduce the dimensionality of feature map.

From Eq. 2, the final pixel of the feature map is obtained by the max-pooling function as following.

$${\text{ap}}_{i,j}^{l} = \arg \mathop {\hbox{max} }\limits_{i,j \in \varOmega } \{ {\text{ac}}_{i,j}^{l} \}$$
(2)

Ω is the n*n non-overlapping sub-region set of the image. max() is the maximal function. In common practice, a 2*2 down-sampling is always used, and the feature map will reduce to a half scale.

  1. (3)

    Normalization Layer

The Relu() function has some merits, while the y-axis only has positive value. Therefore, the output of the pooling layer will be unbalance which is not conducive to network training. Therefore, a Gaussian normalization called batch normalization is used to adjust the data distribution [37].

$$\mu \leftarrow \frac{1}{m*n}\sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{n} {ap_{i,j}^{l} } }$$
(3)
$$\sigma_{\beta }^{2} \leftarrow \frac{1}{m*n}\sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{n} {\left( {ap_{i,j}^{l} - \mu } \right)^{2} } } \,$$
(4)
$$\hat{a}_{i,j}^{l} \leftarrow \frac{{ap_{i,j}^{l} - \mu }}{{\sqrt {\sigma_{{}}^{2} + \varepsilon } }}$$
(5)
$$x_{i,j}^{l} \leftarrow \gamma \hat{a}_{i,j}^{l} + \beta$$
(6)

u and σ are the mean and variance value of input. γ is the scale parameter, and β is the shift parameter. m and n are the height and width of the input image. m * n is the sample number of the batch operation.

  1. (4)

    Dropout layer

Over-fitting is always a sharp problem for artificial neural network, especially for deep neural network because of too many parameters. If the sample is limited, the over-fitting is much more apt to occur. Drop out is an effective strategy that can be used to overcome this problem. For dropout operation, the network is modified by ignoring some output neurons instead of improving the cost function [38]. Therefore, in the dropout procedure, different sets of hidden neurons are dropout which looks like there are different neural networks for training. The different networks will be over-fitting in different ways, and so, the network will reduce the over-fitting probability.

  1. (5)

    Fully connected layer and output layer

The fully connected layer of Alexnet is an expansion of the feature map, and it is always a multilayer perceptron (MLP). The output layer adopts a softmax function as the activation function (other classifiers such as SVM, BP can also be considered). The input of the fully connected layer represents the high-level features of the images. The function of the fully connected layer is that the features of the input image are reconstructed to facilitate classification. The final feature maps are extended to a vector \(af^{{FC_{1} }}\) as the input of the fully connected layer, and then the calculation process is the same as MLP as following,

$$\begin{aligned} af_{j}^{{FC_{k} }} & = relu\left( {net_{j} } \right) \\ \, & = relu\left( {\sum\nolimits_{i = 1}^{I} {w_{i,j}^{{FC_{k} }} *af_{i}^{{FC_{k - 1} }} } + b_{j}^{{FC_{k} }} } \right) \\ \end{aligned}$$
(7)

\(af_{j}^{{FC_{k} }}\) is the k-th fully connected layer output. \(b_{j}^{{FC_{k} }}\) is the bias value of k-th fully connected layer. \(w_{i,j}^{{FC_{k} }}\) is the weight between the i-th input neuron and the j-th output neuron. I is the number of the input neuron.

The output layer neuron number is the classification layer, and a softmax classification is adopted to realize multi-category classifying.

$$o_{j} = softmax\left( {h\left( {af_{j}^{{FC_{L} }} } \right)} \right)$$
(8)
$$h(x_{j} ) = \frac{1}{{1 + e^{{ - \theta^{T} x}} }}$$
(9)

o j is the j-th neuron output. \(af_{{}}^{{FC_{L} }}\) is the last fully connected layer output.

  1. (6)

    The structure of Alexnet

The input of Alexnet is a color image with RGB channels. Alexnet contains five convolutional layers, five max-pooling layers, five batch normalization layers, five dropout layers and two fully connect layers. Figure 4 shows the structure of Alexnet.

Fig. 4
figure 4

Structure of Alexnet

  1. (7)

    Training of Alexnet

The training of the Alexnet adopts the back propagation algorithm. Therefore, the three main procedures of the this algorithm are described as following:

Step 1 Calculate the neuron output a j (j is the jth output neuron) of output layer in the forward process according to the input image data set.

Step 2 According to the output of the output layer, compute the classification error by the loss function E d .

Step 3 Calculate the backward error δ j according to E d on the net j partial derivative.

$$\delta_{j} = \frac{{\partial E_{d} }}{{\partial net_{j} }}$$
(10)

net j is the j-th neuron input which is an unified call, and there will be a different name in different layer.

Step 4 Gradient computation, the loss function E d on the weight w ji partial derivative is computed by the following equation (w ji is the connection between neuron i and neuron j).

$$\frac{{\partial E_{d} }}{{\partial w_{ij} }} = a_{i} \delta_{j}.$$
(11)
  1. (1)

    Output layer weight tuning

In the output layer, we can tune the parameters according to the loss function. The cross-entropy loss function is always adopted in the softmax classification as following Eq. (12).

$$\begin{aligned} E_{d} (\theta ) & = \frac{1}{m}\left[ {\sum\limits_{i = 1}^{m} {t_{i} \log h\left( {ac_{i}^{{FC_{L} }} } \right)} \, + } \right. \\ & \quad \left. {(1 - t_{i} )\log \left( {1 - h\left( {ac_{i}^{{FC_{L} }} } \right)} \right)} \right] \\ \end{aligned}$$
(12)

m is the sample number. t i is the teacher signal. \(ac_{{}}^{{FC_{k} }}\) is the last fully connected layer output.

$$E_{d} (\theta ) = \frac{1}{m}\left[\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{k} {1\{ t_{i} = j\} \log \frac{{e^{{\theta_{j}^{T} ac_{i}^{{FC_{k} }} }} }}{{\sum\nolimits_{l = 1}^{k} {e^{{\theta_{l}^{T} ac_{i}^{{FC_{k} }} }} } }}} }\right ]$$
(13)

k is the number of categories. 1{·} is an indicative function. 1{True} = 1, and 1{False} = 0.

$$\frac{{\partial E_{d} }}{{\partial \theta_{j} }} = \frac{1}{m}\sum\limits_{i = 1}^{m} {\left[ {ac_{i}^{{FC_{L} }} \left( {1\{ t_{i} = j\} - p\left( {t_{i} = j|ac_{i}^{{FC_{L} }} ;\theta } \right)} \right)} \right]}$$
(14)
$$p\left( {t_{i} = j|ac_{i}^{{FC_{L} }} ;\theta } \right) = \frac{{e^{{\theta_{j}^{T} ac_{i}^{{FC_{L} }} }} }}{{\sum\nolimits_{l = 1}^{k} {e^{{\theta_{l}^{T} ac_{i}^{{FC_{L} }} }} } }}$$
(15)

p() is the probability of \(ac_{i}^{{FC_{L} }}\) belongs to category j.

If we set the \(\delta_{j}\) as \(\delta_{j} = - \frac{{\partial E_{d} }}{{\partial \theta_{j} }}\), the weight adjustment equation will be as Eq. (16) according to the stochastic gradient algorithm,

$$\begin{aligned} w_{ij} = w_{ij} - \eta \frac{{\partial E_{d} }}{{\partial w_{ij} }} \hfill \\ \, = w_{ij} - \eta \delta_{j} x_{ij}. \hfill \\ \end{aligned}$$
(16)
  1. (2)

    Fully connected layer weight tuning

The gradient computation of hidden layer is different from the output layer. For the fully connected layer, it is a typical MLP. Therefore, the weight adjustment of each neuron is affected by the sum of backward error connecting to this neuron. Suppose net k is the backward input of the neuron j. Therefore, E d is the function of net k , and net k is the function of net j . According to the chain derivative rule [39] and Eq. (7), the loss function E d on the net j partial derivative is as follows.

$$\begin{aligned} \frac{{\partial E_{d} }}{{\partial net_{j} }} & = \sum\limits_{k = 1}^{n} {\frac{{\partial E_{d} }}{{\partial net_{k} }}\frac{{\partial net_{k} }}{{\partial net_{j} }}} \\ & = \sum\limits_{k = 1}^{n} { - \delta_{k} \frac{{\partial net_{k} }}{{\partial net_{j} }}} \\ & = \sum\limits_{k = 1}^{n} { - \delta_{k} \frac{{\partial net_{k} }}{{\partial a_{j} }}\frac{{\partial af_{j}^{FC} }}{{\partial net_{j} }}} \\ & = \sum\limits_{k = 1}^{n} { - \delta_{k} \omega_{kj} \frac{{\partial net_{k} }}{{\partial a_{j} }}\frac{{\partial relu\left( {net_{j} } \right)}}{{\partial net_{j} }}} \\ & = \sum\limits_{k = 1}^{n} { - \delta_{k} \omega_{kj} } \\ \end{aligned}$$
(17)

Define \(\delta_{j} = - \frac{{\partial E_{d} }}{{\partial net_{j} }}\),

Therefore, \(\delta_{j} = \sum\limits_{k = 1}^{n} {\delta_{k} \omega_{kj} }\).

  1. (3)

    Convolutional Layer Training

According to convolution operation, the convolutional layer can be describe as following,

$$\begin{aligned} net^{l} & = conv\left( {w^{l} ,ac^{l - 1} } \right) + \omega_{b} \\ ac_{i,j}^{l - 1} & = relu\left( {net_{i,j}^{l - 1} } \right) \\ \end{aligned}$$
(18)

conv() is a convolution operation. w l and w b are the convolutional kernel, ac l−1 is the input of the layer l which is the previous layer l-1 output, and \(net_{i,j}^{l - 1}\) is the convolution output of l-1 layer at column i and row j.

According to the chain derivative method, the residual error can be computed as follows [31],

$$\delta_{i,j}^{l - 1} \,=\, \frac{{\partial E_{d} }}{{\partial net_{i,j}^{l - 1} }}{\,=\, }\frac{{\partial E_{d} }}{{\partial ac_{i,j}^{l - 1} }}\frac{{\partial ac_{i,j}^{l - 1} }}{{\partial net_{i,j}^{l - 1} }}$$
(19)

\(\delta_{i,j}^{l - 1}\) is the error item of l-1 layer at column i and row j.

Firstly, consider the second item, because \(ac_{i,j}^{l - 1} = relu\left( {net_{i,j}^{l - 1} } \right)\)

$$\frac{{\partial ac_{i,j}^{l - 1} }}{{\partial net_{i,j}^{l - 1} }} = relu^{\prime}\left( {net_{i,j}^{l - 1} } \right) = 1$$
(20)

According to the convolution computing,

$$\frac{{\partial E_{d} }}{{\partial a^{l - 1} }} = \delta^{l} * w^{l}$$
(21)

We can extend this formula to the style of convolution.

$$\frac{{\partial E_{d} }}{{\partial a_{i,j}^{l - 1} }} = \sum\limits_{m} {\sum\limits_{n} {w_{m,n}^{l} } } \delta_{i + m,j + n}^{l}$$
(22)

w m,n is the weight of one filter at column m and row n.

Then the final result is as follows

$$\begin{aligned} \delta_{i,j}^{l - 1} & = \frac{{\partial E_{d} }}{{\partial net_{i,j}^{l - 1} }} \\ & { = }\frac{{\partial E_{d} }}{{\partial ac_{i,j}^{l - 1} }}\frac{{\partial ac_{i,j}^{l - 1} }}{{\partial net_{i,j}^{l - 1} }} \\ & { = }\sum\limits_{m} {\sum\limits_{n} {\omega_{m,n}^{l} \delta_{i + m,j + n}^{l} } } f^{\prime}\left( {net_{i,j}^{l - 1} } \right) \\ & { = }\sum\limits_{m} {\sum\limits_{n} {\omega_{m,n}^{l} \delta_{i + m,j + n}^{l} } } \\ & { = }\,\delta^{l} * W^{l} \\ \end{aligned}$$
(23)

If the filter number is N, then the error item will be a summation process.

$$\delta^{l - 1} = \sum\limits_{d = 0}^{N} {\delta_{d}^{l} * W_{d}^{l} }$$
(24)

d is the filter number.

The error item comes from the deeper layer directly for the training of the bias item,

$$\frac{{\partial E_{d} }}{{\partial b^{l - 1} }} = \sum\limits_{i} {\sum\limits_{j} {\delta_{i,j}^{l} } }$$
(25)

For the max-pooling layer, the error item will be transferred to the maximum value neurons of the corresponding block of the up layer, and the other neurons error will be zero.

3 Collision Avoidance Rule

The deep CNN is used to study the maneuver ability of crewman. In the maneuvering of vessel, the crewman should consider the COLREGs which is a collision avoidance operation standard for vessel navigation on the sea [1, 40]. The design of the AMV NGC system also should always with respect to the COLREGs. Therefore, in the sample selection of CNN or in the sample generation, the COLREGs should be considered because the COLREGs are implicit which only reflects in the operation strategy. The CNN can be used to learn the human maneuvering experience and realize AMV automatic navigation. Now, we give a brief description of COLREGs.

The COLREGs provide a maritime navigation safe guideline for maneuvering at sea, and it was designed for the navigators who operate the vessel based on their experience accordingly [41]. According to the COLREGs, there are three encounter statuses including the head-on, crossing and overtaking situations as shown in Fig. 5. A suitable collision avoidance operation should be adopted when an encounter status occurs in a good visibility. The vessel maintains course and speed called stand-on vessel, while the given-way vessel has responsibility for the avoidance maneuver according to the COLREGs (see Fig. 6).

Fig. 5
figure 5

Encounter situation

Fig. 6
figure 6

Given-way vessel yields to the stand-on vessel

If two vessels encounter a situation, both ships have the opportunity of taking appropriate strategy to realize collision avoidance. When the vessels have a collision risk, the given-way ship should adopt an appropriate operation to keep a safe passing distance according to the regulations, and the stand-on ship should maintain its course and speed. However, if the given-way ship does not give a helpful operation to keep a safe passing distance according to the COLREGs rules, then the stand-on ship should adopt a suitable strategy to realize collision avoidance. Figure 7 shows the collision avoidance operation at different encounter situations. In the situation of head-on, both of the vessel have duty to realize collision avoidance by turn right as shown in Fig. 7a. In the overtaking situation, the overtaking ship should turn right to overtake the stand-on vessel as shown in Fig. 7b. If there is a crossing situation, the collision avoidance operation will depend on the orientation of crossing. If it is at the right crossing section, vessel 1 should turn right and vessel 2 is the stand-on vessel; if it is at the left crossing section, vessel 2 should turn right. Figure 7e shows a parallel crossing situation.

Fig. 7
figure 7

Operation of different encounter situations. a Head-on situation, b overtaking situation, c crossing situation 1, d crossing situation 2, e collision avoidance for parallel crossing

The collision avoidance operations are taken by the own ship are reliant on the target vessel behavior as well as the regulations. The collision avoidance operations of vessels always have two categories: changing the course of ship and changing the speed of ship. However, the course changing strategy is preferred on traditional navigational systems due to the difficulties and delays in engines control from the bridge. The speed changing mode is always adopted in critical situation when a single course changing mode cannot achieve collision avoidance. Therefore, the Alexnet will be used to learn the course change characteristics of the crewman according to optical vision information.

4 Simulations

In order to demonstrate the effectiveness of the method, some simulation studies were carried out. In conventional collision avoidance simulation test, we can make use of the kinetic model of AMV and the motion analysis theory to demonstrate the position of AMV, and then the collision avoidance system can be testified according to this information. However, we should get the vision data coming of AMV as input for the proposed method. The operation of AMV is reliant on human interaction at present. Although the camera may be installed on AMV, the image and operation are not labeled. On the other hand, the AMV is not popularization as car. Therefore, we take advantage of the game of European Ship Simulator to catch the vision data by the Fraps software which reflects the maneuvering characteristics. The Fraps software is running with the European Ship Simulator, and the forward camera data of AMV are recorded manually by pushing the function key of the Fraps. Figures 8, 9, 10 and 11 show snapshot of the recording data. And then the images are cropped to the specified images size 448 * 224 for generating the sample data of deep neural network. The training of the network is realized on a workstation (DELL T7910) with a Tesla K40 GPU accelerator.

Fig. 8
figure 8

Some collision avoidance situations. a Initial head-on situation, b head-on situation after collision avoidance, c initial crossing situation, d crossing situation after collision avoidance, e initial overtaking situation, f overtaking situation after collision avoidance, g navigation in channel initial situation, h after collision avoidance in channel

Fig. 9
figure 9

Digital images preprocessing for recording data. a Original image, b the original image with noise, c translation of original image, d the translation image with noise

Fig. 10
figure 10

Training process after 300 iterations

Fig. 11
figure 11

Training process starting from 2000 iterations

Several standard encounter situations are created here in order to simulate the whole collision avoidance process and capture the actions and vision data.

Firstly, a standard head-on scenario is simulated in European Ship Simulator and some samples are recorded for Alexnet training. The AMV is steered heading toward the target vessel as Fig. 8, resulting in a head-on situation. When the potential collision is detected, the own and target vessel all need adopt a starboard maneuver to navigate on the target vessel’s port side according to the COLREGs, and the AMV takes a starboard maneuver as Fig. 8b. Figure 8c, d shows the crossing encounter situation. As shown in Fig. 8c, the vessel is crossing to the right side of the AMV so that there is collision risk unless taking an appropriate measure to avoid this situation. The AMV maneuvers starboard to avoid a collision as Fig. 8d while the target vessel should maintain his course and speed according to the COLREGs.

As shown in Fig. 8e, f, the AMV is overtaking the target vessel. According to the COLREGs, the AMV has a full responsibility of avoiding collision. When the collision risk occurs, the AMV takes a starboard operation to avoid the collision, as illustrated in Fig. 8f, which is with respect to the COLREGs. The AMV overtakes the target vessel successfully without causing a collision.

Figure 8g, h shows the obstacle detection based on vision, and the AMV can navigate successfully in the channel. Also note that wind and waves are ignored in the scene.

In order to augment the data, we adopt some digital image processing technique to increase the sample numbers, such as translation, small rotation and additional noise as Fig. 9.

Although there are many obstacles in the samples, such as the ship mast, the window frame and a part of ship, the Alexnet can learn the driving ability according to the samples. In order to exhibit the specific training process, two figures are adopted due to the big difference loss value in the training process. Figure 10 shows the training process after 300 iterations, and Fig. 11 shows the training process between 2000 and 10,000 iterations. At the beginning of the training, there is a bigger training loss value up to 2.5 * 107. After hundreds of training, the loss value comes down to 300. And then the loss value falls down to a small value through iterations.

After training of the network, Alexnet studies the maneuver characteristics of man to handle the typical encounter situations. After the convolution, there are many feature maps which can reflect the learning information. Figure 12a shows the input image, and Fig. 12b shows the feature maps after the first convolution. There are kinds of feature maps to describe the input image. After the convolution layer, there will be the max-pooling layer. The max-pooling layer reduces the size of input image. The following normalization layer will adjust the distribution of the data which is the input of convolution layer with half of input image as Fig. 12c. The following feature maps will be smaller and smaller in size as shown in Fig. 12d which may be the feature component of the input.

Fig. 12
figure 12

Convolution layer output of Alexnet. a One of the inputs, b convolution layer 1 output, c convolution layer 2 output, d convolution layer 3 output

After the training, the deep neural network can predict the maneuver operations when an encounter situation occurs. For example, Fig. 13 shows the maneuver operation prediction of overtaking encounter situation. According to prediction result, the NGC system can steer the AMV and realize automatic collision avoidance.

Fig. 13
figure 13

Prediction of maneuver

5 Conclusions

AMV has a great application requirement, such as in military, civilian and commercial. In order to enhance the level of automation of AMV, a collision avoidance method is proposed based on the vision system. Alexnet has been successfully applied to image recognition. Therefore, a deep convolution neural network (Alexnet) is used to study the maneuver characteristics of crewman. In order to obtain enough samples, European Ship Simulator is adopted to simulate an AMV and create some encounter scenes. These scenes are captured by software. And also, the collision avoidance maneuver is with respect to the COLREGs to make sure the Alexnet learning the correct abilities. Kinds of encounter situations are captured to train and testify the Alexnet. In the training of CNN, it can extract the features automatically and used for pattern recognition. After training of Alexnet, the Alexnet has studies the maneuver ability based on the vision system, and the final network can predict the collision avoidance operations which indicates the validity of the proposed approach. This approach would effectively promote the automatic degree of USV and reduce the human interactions in the navigation. In future work, some effect and light deep structure will be considered which are more suitable for embedding system.