1 Introduction

Ship recognition and classification algorithms in ocean area are important to enhancing maritime safety and security [12, 17]. The goal of vessel/ship classification is to recognize the type of ships in a given image with as much details as possible. Beyond the typical recognition challenges caused by partial occlusions and variations in scale [38], ships are especially difficult to recognize since ships’ appearance is usually significantly affected by external factors such as weather conditions (i.e., cloudy, sunny, etc), viewing geometry, and sea state [29]. Moreover, wide variation within-class in some types of vessels also makes vessels classification more complicated and challenging [2].

In recent years, numerous efforts have been investigated in the problem of vessel recognition in optical remote sensing imagery, achieving many diverse solutions to the problem [10, 11, 17]. In [15, 26], several well-studied feature extraction methods in facial recognition were applied to a small set of ship images. The feature extraction methods include global/holistic features: Principal Components Analysis (PCA) [25], Linear Discriminant Analysis (LDA) [4], Independent Component Analysis (ICA) [3], Random Projections (Rand) [42] and Multilinear PCA (MPCA) [25]; and local features: hierarchical multiscale Local Binary Pattern (HMLBP) [14] and Histogram of Oriented Gradients (HOG) [9]. HMLBP operates on multiple gray-scale texture to obtain scale-invariant local spatial structure and texture information of the image. HOG is essentially an image descriptor that represents the image by the distribution of local intensity gradients or edge directions, to capture the local object appearance and shape in the image. On the basis of previous work in [15], Rainey et al. [35] explored the efficacy of several object recognition algorithms at classifying ships and other ocean vessels in commercial panchromatic satellite imagery.

In addition to the above mentioned methods, the Bag of (visual) words (BOVW) [19], as the one of the most popular approaches in image analysis and classification applications, was also applied to ship classification. The BOVW, which is inspired by the bag of words representation used in text classification tasks, represents an image as a histograms of frequencies of a set of local descriptors such as Scale-Invariant Feature Transform(SIFT) [24]. In order to increase robustness of the feature, Rainey et al. [5] replaced Scale-Invariant Feature Transform (SIFT) descriptors in the BOVW mode with Pyramid of Histograms of visual Words (PHOW [34]) descriptors, which are extracted on a dense grid of key points and provide greater accuracy on resized non-degraded data. An algorithm for vessel classification, which combined BOW histograms with sparse representation based classification (SRC), was proposed in [36]. The original BOW and BOW with 2-normalized term frequency (TF) weighting schemes were studied in [28] to the vessel type classification task in the maritime domain. In [1], the Local Binary Patterns (LBP [21]), which has high distinctiveness and low computational complexity, was developed.

According to the way of extracting features, these approaches can usually be partitioned into two categories: global (holistic) and local. Global features usually describe the image using a lower-dimensional space or histogram, and form the feature space by utilizing the entire image. Thus the global features are often easily implemented and have low computational cost, but their performances for classification are limited. Local features usually use a set of local descriptors to characterize the ship image. However, when the shapes of two classes are relatively similar, the local features are not discriminative enough to distinguish. Although the aforementioned feature extraction approaches (i.e., local features such as BOVW, and global features such as MPCA) have achieved satisfying performances in vessel classification, each type of feature has its own advantages and limitations. Moreover, it is known that different features have their own characters to capture different information of ships and certain features may be only suitable for one specific pattern. To this end, the global feature, which is complementary to the local features, is utilized to represent more discriminative information of a class.

Inspired by this observation, a multiple features learning (MFL) framework is proposed for ship classification in this work. The proposed method employs three types of features to simultaneously capture global structures and local details information of various ships. We employ Gabor-based multi-scale completed local binary pattern (MS-CLBP [7]), which fuses the benefits of Gabor wavelet and CLBP descriptor, to extract the global feature from images, and we treat it as the first type of features in our work. This global feature captures the scale-invariant and orientation-invariant spatial structure and texture information from the entire image. The local feature extraction method based on patch-based MS-CLBP and Fisher vector (FV [37]) is used to obtain a local feature representation of ship images which is considered as the second type of features. The patch-based MS-CLBP and FV has been demonstrated great success in image classification [16]. The BOVW, as a classical feature extract method in the object recognition domain, is employed in the proposed algorithm. We further combine BOVW and spatial pyramid matching (SPM [20]) as the third type of features, which is able to overcome the orderless of bag-of-features representation and enhance the spatial order structure of local features. The FV and BOVW are both visual code models, but the FV employs a soft assignment strategy while the BOVW employs a hard assignment strategy. In the FV model, we use MS-CLBP operator to extract local texture information from image patches. The BOVW model utilizes the SIFT descriptor to roughly represent edge information in an image patch. Note that both the FV and BOVW are employed in order to obtain more comprehensive local features of ship images.

In this paper, we simultaneously extract global and local features to comprehensively capture the characteristics of ships. After feature extraction, fusion strategies are followed for a final classification. It is well-known that the fusion strategies mainly include feature-level and decision-level fusion. In our scheme, feature-level fusion and decision-level fusion are both investigated in the proposed classification framework. Feature-level fusion combines different feature vectors together into a single feature vector. Decision-level fusion, which performs on probability outputs of each individual classification pipeline and combines the distinct decisions into a final one, can be divided into two types: “hard” fusion at the class-label level and “soft” fusion at the probability level. However, hard-fusion (e.g., majority voting (MV) rule [47]) may lead to a rough result. In this paper, we choose a “soft” fusion scheme, namely logarithm opinion pool (LOGP [22]), and verify its effectiveness by comparing it with the hard-fusion method (i.e., MV) in the experimental analysis.

There are two main contributions of this work. First, a multiple feature learning (MFL) method using three types of features including global and local features is proposed. In the proposed classification framework, feature-level and decision-level fusion are both investigated. Second, Gabor-based MS-CLBP is employed to extract global feature to compensate for the local feature, therefore taking full advantage of the complementary nature between global and local features. The proposed method is extensively evaluated using two public optical remote sensing ship image datasets. The experimental results verify the effectiveness of our proposed method as compared to some state-of-the-art algorithms.

The remainder of the paper is organized as follows. Section 2 describes the proposed approach, including the three types of features extraction and the feature-level as well as decision-level soft fusion strategy. Section 3 introduces two experimental datasets (i.e., BCCT 200 −resize [35] and VAIS [48]). Section 4 reports the experimental results and provides some analysis. Finally, Section 5 makes several concluding remarks.

2 Proposed classification framework

The flowchart of the proposed method is illustrated in Fig. 1. There are two fusion strategies for the proposed classification framework, i.e., feature-level fusion and decision-level fusion approaches, which are illustrated in Fig. 1a and b, respectively.

Fig. 1
figure 1

Flowchart of the proposed multiple features learning framework: a feature-level fusion, b decision-level fusion

As shown in Fig. 1a, we first extract three kinds of features of the input image respectively. The first one is Gabor-based MS-CLBP. We use multi-orientation (e.g., π/8, π/4, π/2, etc.) Gabor filters to obtain multiple Gabor feature images, where the MS-CLBP is then employed. The second one is a local feature which is extracted by patch-based MS-CLBP and FV. We partition the image and its multi-scale versions into dense patches and using the CLBP descriptor to extracts a set of local patch descriptors. Then, FV encoding is used to encode the local patch descriptors into a discriminative representation. The third one is the combination of BOVW and SPM. We employ the dense SIFT descriptor by partitioning an image into small patches. Then k-means clustering is employed to generate the codebook which presents the visual words of the BOVW. The SPM is further employed to calculate the local features, which enhances the spatial order structure of local features. Finally, we employ the typical support vector machine (SVM) classifier [6] to obtain final classification performance. Details of the features extraction is presented in Section 2.1.

The decision-level fusion framework is shown in Fig. 1b. For each input image, three kinds of features are extracted separately. Then, for each of the three types of features, the typical SVM classifier is employed to calculate the probability estimates, respectively. Finally, the proposed method merges outputs of each individual classification pipeline using decision-level soft-fusion (i.e., LOGP) to obtain final classification result.

2.1 Feature extraction

2.1.1 Gabor-based MS-CLBP

Inspired by the success of Gabor filters and LBP in computer vision applications, we employ Gabor-based MS-CLBP as the first type of features in the proposed framework to extract global features of ship images. The features extraction process is summarized in Algorithm 1.

figure f

A Gabor wavelet [8, 30] transform can be viewed as an orientation-dependent bandpass filter. Its impulse response is defined by a sinusoidal wave multiplied by a Gaussian function. In 2-D (x, y) coordinate system, a Gabor filter, which includes a real component and an imaginary term, can be defined as,

$$ {G}_{\varepsilon ,\theta ,\psi ,\sigma ,\gamma }(x,y)=\exp \left( -\frac{{x^{\prime}}^{2}+{{\gamma }^{2}}{y^{\prime}}^{2}}{2{{\sigma }^{2}}} \right)\exp \left( i\left( 2\pi \frac{x^{\prime}}{\lambda }+\psi \right) \right), $$
(1)

where \( x^{\prime }=x\cos \theta +y\sin \theta \), \( y^{\prime }=-x\sin \theta +y\cos \theta \). Here, ε is the wavelength of the sinusoidal factor, 𝜃 is the orientation separation angle (e.g., π/8, π/4, π/2, etc.) of Gabor kernels, ψ represents the phase offset, σ is the standard derivation of the Gaussian envelope, and γ is the spatial aspects ratio specifying the ellipticity of the support of the Gabor function. Note that ψ = 0 and ψ = π/2 return the real and imaginary parts of the Gabor filter, respectively. Parameter σ is determined by ε and spatial frequency bandwidth bw as,

$$ \sigma =\frac{\varepsilon }{\pi }\sqrt{\frac{\ln 2}{2}}\frac{{{2}^{bw}}+1}{{{2}^{bw}}-1} . $$
(2)

The LBP [27] is a texture operator, which is able to characterize the spatial structure information of local image texture, and it has been widely employed in object recognition (e.g., texture classification, face recognition, etc). Given a pixel in the image, which gray value is denoted as g c . Its neighboring pixels are equally spaced on a circle of radius. The resulting LBP for g c in decimal number can be computed by comparing it with its neighbors,

$$\begin{array}{@{}rcl@{}} LB{{P}_{m, r}}\left( {{g}_{c}} \right)&=&\sum\limits_{p=0}^{m-1}{s\left( {{g}_{p}}-{{g}_{c}} \right){{2}^{p}}=\sum\limits_{p=0}^{m-1}{s\left( {{d}_{p}} \right){{2}^{p}}}}, \\ s({d}_{p})&=&\left\{ \begin{array}{cc} 1, &{{d}_{p}}\ge 0 \\ 0, &{{d}_{p}}<0 \end{array} \right. \end{array} $$
(3)

where g p is the gray value of the neighbors, d p = (g c g p ) represents the difference between the center pixel and each neighbor, r represents the radius of a circle, and m is the total number of involved neighbors. If the coordinate of g c is (0, 0), the coordinates of g p are denoted as \((-r\sin (2\pi {i}/{m}),r\cos (2\pi {i}/{m}))\). After the LBP coded image is generated by compute the LBP values of all pixels in the image, a histogram is calculated to represent the texture image. Nevertheless, the LBP only uses the sign information of d p while ignoring the magnitude information.

In the improved CLBP [13], the sign and magnitude information, i.e., CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M), is complementary. The CLBP_S is equivalent to the traditional LBP operator and the CLBP_M operator is expressed as,

$$\begin{array}{@{}rcl@{}} CLBP\_{{M}_{m,r}}\left( {{g}_{c}} \right)&=&\sum\limits_{p=0}^{\text{m}-1}{f\left( \left| {{g}_{p}}-{{g}_{c}} \right|, \eta \right)=\sum\limits_{p=0}^{m-1}{f\left( \left| {{d}_{p}} \right|,\eta \right)}}, \\ f(x,y)&=&\left\{ \begin{array}{cc} 1, &{{d}_{p}}\ge \eta \\ 0, &{{d}_{p}}< \eta \end{array} \right. \end{array} $$
(4)

where η is a threshold that is usually set to the mean value of \(\left | {{d}_{p}} \right |\). The syncretic CLBP can describe both the spatial and depth information by mapping the CLBP_S and CLBP_M into histograms. To make the feature scale invariant, the multi-scale representation of CLBP, termed as MS-CLBP, is considered. In our work, the multi-scale CLBP representation is formed by concatenated the CLBP_S and CLBP_M histogram features extracted at different scales which are obtained by down-sampling the original image using the bicubic interpolation [7]. An example of the implementation of a 3-scale CLBP operator is illustrated in Fig. 2.

Fig. 2
figure 2

An example of a 3-scale CLBP operator (m = 10, r = 1): a scale = 1, b scale = 1/2, c scale = 1/3

2.1.2 Patch-based MS-CLBP and FV

As noted in [6], image representation method based on patch-based MS-CLBP and FV can well extract the local feature of image and has achieved great performance in remote sensing image scene classification. In this work, we first use regular grid to partition an entire image into B × B overlapped patches. For simplicity, the overlap between two patches is half of the patch size (i.e., B/2) in both horizontal and vertical directions. For each patch, we employ MS-CLBP with the second implementation capture the spatial pattern and the contrast of local image texture. If M patches are extracted from the multi-scale images, we can obtain a set of local patch descriptors and form a feature matrix denoted by H = [h 1, h 2, ..., h M ], where h i is the CLBP histogram feature vector extracted from patch i.

After local feature extraction, FV encoding is used to encode the local patch descriptors into a discriminative representation. Let X = {x i , i = 1, ⋯, N} be the set of local patch descriptors extracted from an image. A Gaussian mixture model (GMM), which probability density function is denoted as u λ with parameter λ, is trained on the training images using Maximum Likelihood (ML) estimation [23]. The image can be characterized by gradient vector,

$$ \mathbf{G}_{\mathbf{\lambda} }^{\mathbf{X}}=\frac{1}{N}{{\nabla }_{\mathbf{\lambda} }}\log {{u}_{\mathbf{\lambda} }}(\mathbf{X}) . $$
(5)

The gradient of the log-likelihood describes the direction along which parameters are to be adjusted to best fit the data. The GMM u λ can be represented as

$$ {{u}_{\mathbf{\lambda} }}=\sum\limits_{t=1}^{T}{{{\omega }_{t}}{{u}_{t}}(\mathbf{x})} . $$
(6)

We denote λ = {ω t , μ t , Σ t , t = 1, ..., T}, where ω t , μ t and Σ t are the mixture weight, mean vector, and covariance matrix of Gaussian μ t , and T is the number of Gaussian in the GMM. Under an independence assumption, the covariance matrices are diagonal, i.e., \({{\Sigma }_{t}}=\text {diag}({\sigma _{t}^{2}})\). We make use of Fisher kernel (FK) of [18] to measure the similarity between two samples x and y, and it is defined as,

$$ \mathbf{S}(\mathbf{x},\mathbf{y})=\mathbf{G}{{_{\mathbf{\lambda} }^{\mathbf{x}^{\prime}}}}\mathbf{F}_{\mathbf{\lambda} }^{-1}\mathbf{G}_{\mathbf{\lambda} }^{\mathbf{x}}, $$
(7)

where F λ is the Fisher information matrix of u λ : \({{\mathbf {F}}_{\mathbf {\lambda } }}={{\mathbf {E}}}[\nabla _{\mathbf {\lambda } }^{{}}\log {{u}_{\mathbf {\lambda } }}(\mathbf {x}){{\nabla }_{\mathbf {\lambda } }}\log {{u}_{\mathbf {\lambda } }}{{(\mathbf {x})}^{\prime }}]\). Following the diagonal closed-form approximation of [31], the FK can be rewritten as a dot-product between normalized vectors \({{\mathbb {G}}_{\mathbf {\lambda } }}\) with,

$$ \mathbb{G}_{\mathbf{\lambda} }^{\mathbf{x}}=\mathbf{F}_{\mathbf{\lambda} }^{-1/2}\mathbf{G}_{\mathbf{\lambda} }^{\mathbf{x}} . $$
(8)

Let γ i (t) be the occupancy probability, i.e., the probability of descriptor x i generated by the Gaussian u t ,

$$ {{\gamma }_{i}}(t)=\frac{{{\omega }_{t}}{{u}_{t}}({{\mathbf{x}}_{i}})}{\sum\limits_{j=1}^{T}{{{\omega }_{j}}{{u}_{j}}({{\mathbf{x}}_{i}})}}. $$
(9)

Let \(\mathbb {G}_{\mu ,t}^{\mathbf {X}}\) (resp. \(\mathbb {G}_{\sigma ,t}^{\mathbf {X}}\)) be the gradient with respect to the mean μ t (resp. standard deviation σ t ) of Gaussian t. Mathematical derivations lead to,

$$ \mathbb{G}_{\mu ,t}^{\mathbf{X}}=\frac{1}{N\sqrt{{{\omega }_{t}}}}\sum\limits_{i=1}^{N}{{{\gamma }_{i}}(t)\left( \frac{{{\mathbf{x}}_{i}}-{{u}_{t}}}{{{\sigma }_{t}}}\right)}, $$
(10)

and

$$ \mathbb{G}_{\sigma ,t}^{\mathbf{X}}=\frac{1}{N\sqrt{2{{\omega }_{t}}}}\sum\limits_{i=1}^{N}{{{\gamma }_{i}}(t)\left[\frac{{{({{\mathbf{x}}_{i}}-{{u}_{t}})}^{2}}}{{{\sigma }_{t}}}-1\right]} . $$
(11)

The final gradient vector is just the concatenation of the \(\mathbb {G}_{\mu ,t}^{\mathbf {X}}\) and \(\mathbb {G}_{\sigma ,t}^{\mathbf {X}}\) vectors for t = 1, ..., T. Therefore, the dimensionality of the FV is (2D + 1) × T, where D denotes the size of the local descriptors.

Each training image yields a feature matrix representing the local patch descriptors using patch-based MS-CLBP operator. Then, all the feature matrices of the training data are used to estimate the GMM parameters via the Maximum Likelihood (ML) algorithm. After that, the FV is utilized to generate the final feature representation. The features extraction process is summarized in Algorithm 2. Figure 3 illustrates the detailed procedure for generating FV of training images. For the testing image, we use the patch-baesd MS-CLBP feature extraction method shown in the Fig. 3 to obtain local descriptors. Then the Fisher Kernel representation is utilized based on the GMM which is estimated from the training data to obtain the final FV feature. The detailed process for generating FV is shown in Fig. 4.

figure g
Fig. 3
figure 3

Fisher vector representation of the training images

Fig. 4
figure 4

Fisher vector representation of a testing image

2.1.3 BOVW and SPM

In recent years, the BOVW model, which is a classical local feature representation method, has demonstrated its outstanding performance in several object recognition tasks such as face recognition. Since traditional BOVW model ignores spatial and structural information, we affiliate SPM to the BOVW framework since the SPM can capture the spatial arrangement of images. The BOVW features extraction process is summarized in Algorithm 3, and the overall description of combination feature of BOVW and SPM is summarized in Algorithm 4.

First of all, we partition image into small blocks using regular grid and then we extract local descriptor utilizing the SIFT operator on each block whose center is viewed as a key point. An entire image is partitioned into π × π overlapped blocks and the overlap between two blocks is half of the block size (i.e., π/2) in both horizontal and vertical directions.

In doing this, we can obtain a set of local descriptors. Then, k-means clustering is employed to generate the codebook which presents the visual words of the BOVW. After that, we encode the patches based on the codebook using vector quantization and calculate the frequent histogram. The SPM is further employed to enhance the spatial order structure of local features. In the SPM framework, an image is partitioned into 2l × 2l segments in different scales (e.g., l = 0, 1, 2), and the BOVW histogram within all the segments is calculated respectively. The final feature representation of the image is the concatenation of all the histograms. An example of the implementations of the SPM is illustrated in Fig. 5.

figure h
figure i
Fig. 5
figure 5

Example of a three-level spatial pyramid matching (l = 0, 1, 2)

2.2 Feature-level fusion

After feature extraction, feature-level fusion and decision-level fusion are both investigated for final classification. Figure 1a illustrates the feature-level fusion [43] employed in this work. For multiple feature learning, the Gabor-based MS-CLBP extracts spatial structure and texture feature; the local feature extract method based on patch-based MS-CLBP and FV is visual code model with soft assignment strategy; and another local feature extracts method combined of BOVW and SPM is visual code model with hard assignment strategy. For various classification tasks, these features have their own advantages and disadvantages, and it is difficult to determine which one is always optimal [21]. Thus, we use feature-level fusion strategy to fusion three types of features by straightforward stacking feature vectors into a composite one. The specific implementation of feature fusion process is represented in Algorithm 5. To modify the scale of feature values, feature normalization before feature stacking is a necessary preprocessing step.

figure j

2.3 Decision-level fusion

Decision-level fusion [32] is to merge results from each individual classification pipeline and combines distinct classification results into a final decision which can improve the accuracy of a single classifier that uses a certain type of features. Score level fusion [39, 45, 46] is a special case of decision-level fusion, and it is equivalent to the soft fusion of decision-level fusion. The goal here is to utilize score level fusion combine the posterior-probability estimations provided from each individual classification pipeline. In this paper, we employ LOGP soft-fusion rule, also known as product rule [44] in score level fusion scheme. The process is illustrated in Fig. 1b.

Assume \({{p}_{i}}({{y}_{k}}\left | \mathbf {x} \right .)\) be the conditional class probability of the i th classifier in class k(0 < kC) (C is the number of classes and y k indicates the kth class to which a sample x belongs.). The LOGP rule [33] utilizes individual conditional class probabilities of each classifier to estimate a global membership function \(\mathbf {P}({{y}_{k}}\left | \mathbf {x} \right .)\), which is a weighted product of these output probabilities. The final class label is given according to,

$$ {{y}^{*}}=\arg \underset{k=1, ...,C}{\max}\,= \mathbf{P}({{y}_{k}}\left| \mathbf{x} \right.), $$
(12)

where the global membership function is defined as,

$$ \mathbf{P}({{y}_{k}}\left| \mathbf{x} \right.) = \underset{i=1}{\overset{3}{\prod}}\,{{p}_{i}}{{({{y}_{k}}\left| \mathbf{x} \right.)}^{{{\alpha}_{i}}}}, $$
(13)

or

$$ \log \mathbf{P}({{y}_{k}}\left| \mathbf{x} \right.) = \sum\limits_{i=1}^{3}{{{\alpha}_{i}}{{p}_{i}}({{y}_{k}}\left| \mathbf{x} \right.)}, $$
(14)

where \(\{{{\alpha }_{i}}\}_{i=1}^{3}\) is the classifier weights uniformly distributed over all of the classifiers. The overall description of decision-level fusion is summarized in Algorithm 6.

figure k

3 Experimental datasets

To evaluate the efficacy of the proposed MFL approach for ship classification, we conduct extensive experiments using optical image datasets. In the experiments, a library for SVMFootnote 1 is employed for classification, which is able to provide posterior-probability estimation for each type of features.

The first dataset is an overhead satellite scene referred to as BCCT200-resize which is detailed in [35]. It consists of small grey-scale ship images chipped out of larger electro-optical satellite images by the RAPIER Ship Detection System. The chips have been pre-processed to be rotated and aligned to have uniform dimensions and orientation. The dataset contains 4 different ship categories (i.e., barges, cargo ships, container ships, and tankers). Each class contains 200 images of size 150 × 300 pixels. Examples of each class in this data set can be seen in Fig. 6.

Fig. 6
figure 6

Example images from each class of the BCCT200-resize data: a barge, b cargo ship, c container ship, and d tanker

The second dataset is the original BCCT200 dataset, in which the images are unprocessed and display ships under various orientations and resolutions. Such variations makes the dataset more challenging. The dataset includes the following classes: barges, container ships, cargo ships and tankers. 200 images have been collected from each class which are of non-uniform size. Example images of each class are shown in Fig. 7.

Fig. 7
figure 7

Example images from each class of the original BCCT200 data

In order to facilitate a fair comparison, we follow the same experimental setup reported in [35] for the above two datasets. Five-fold cross-validation is performed, in which the dataset is randomly partitioned into five equal and disjoint subsets. There are 40 images from each ship class in a subset. For each run, a different subset is held out for testing and the remaining four are used for training. The classification accuracy is average results over the five cross-validation evaluations.

The third dataset used in our experiments, referred to as VAIS [48], is the world’s first publicly available dataset of paired visible and infrared ship imagery. The dataset consists of 2865 images (1623 visible and 1242 IR), of which there are 1088 corresponding pairs. The dataset includes 6 coarse-grained categories: merchant ships, sailing ships, medium passenger ships, medium “other” ships, tugboats and small boats. The area of the visible bounding boxes ranged from 644 − 6350890 pixels, with a mean of 181319 pixels and a median of 13064 pixels. Example bounding box images are shown in Fig. 8.

Fig. 8
figure 8

Five visible image samples from each of the main categories of the VAIS data

The dataset is partitioned into “official” train and test splits. All images in this dataset greedily assigned from each named ship to either partition. This resulted in 539 image pairs and 334 singletons for training, and 549 image pairs and 358 singletons for testing. In our experiments, we only use the visible ship imagery category. Followed the same pre-process of image as deep convolutional neural network (CNN) algorithm in [48], we resize each crop to the size using bicubic interpolation. Note that the original images in these two experimental datasets are color images, the images are converted from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for ship classification.

4 Experimental results and analysis

4.1 Parameters setting

First of all, we investigate optimal parameters for each type of features. For the first one (i.e., the Gabor-based MS-CLBP), according to [30], the orientations 𝜃 of Gabor filtering set as \(\left [ 0, \frac {\pi }{8}, \frac {\pi }{4}, \frac {3\pi }{8}, \frac {\pi }{2}, \frac {5\pi }{8}, \frac {3\pi }{4}, \frac {7\pi }{8} \right ]\) and the spatial frequency bandwidth set 5 for both two experimental datasets. The result of Gabor filtering of a sample image is shown in Fig. 9. Then we estimate the optimal parameter pair (m, r) for the CLBP operator. In this experiment, we vary the CLBP parameter sets, and fix parameters for others. The classification results are listed in Tables 1 and 2 for two experimental datasets, respectively.

Fig. 9
figure 9

Example of Gabor filter: a input image, bi 8 Gabor filtered images corresponding to 8 different orientations for the input image

Table 1 Classification accuracy (%) of with the first feature different parameters (m, r) for the BCCT200-resize dataset
Table 2 Classification accuracy (%) of with the first feature different parameters (m, r) for the original BCCT200 dataset

Since the number of neighbors (i.e., m) and the radius (i.e., r) directly impact the dimensionality of the CLBP histogram features. For example, larger m will increase the feature dimensionality and computational complexity. Based on the results in Tables 12 and 3, we choose \(\left (m, r \right )=\left (10, 6 \right )\) for the BCCT200-resize dataset; \(\left (m, r \right )=\left (12, 7 \right )\) for the original BCCT200 dataset and \(\left (m, r \right )=\left (10, 5 \right )\) for the VAIS dataset in terms of classification accuracy as well as computational complexity. Thus, the dimensionality of the CLBP features (CLBP_S and CLBP_M histograms concatenated) for both the BCCT200-resize dataset and the VAIS dataset is set to 216.

Table 3 Classification accuracy (%) of with the first feature different parameters (m, r) for the VAIS dataset

For the second type of features (i.e., patch-based MS-CLBP and FV), since the feature extracts dense local patches, the size of patch (B × B) is studied. Varying patch sizes B result varying number of local patches, therefore the number of Gaussians (i.e., K) in the GMM is studied simultaneously. In the experiment, B is chosen from a reasonable region, and Fig. 10 illustrates the classification results with varying patch sizes B as well as different numbers of Gaussians K in the GMM for the two experimental datasets. Optimal parameters of B and K are further determined from the tuning results in Fig. 10. That is, the optimal patch size B is 28 and the number of Gaussians K is 20 for the BCCT200-resize dataset; as for the original BCCT200 dataset, the optimal patch sizes B is 22 and the number of Gaussians K is 25. For the VAIS dataset, the optimal patch size B and the number of Gaussians K both be set to 20. For the MS-CLBP parameters, the scale is set to \(1/\left [ 1:3 \right ]\) and another parameter pair is chosen (m, r) = (6,3), empirically.

Fig. 10
figure 10

Parameter tuning of and for the second feature using three experimental data sets: a the BCCT200-resize, b the original BCCT200, c the VAIS

For the third type of features (i.e., the BOVW and SPM), the size of block (i.e., ϖ × ϖ) is studied for the dense SIFT descriptors. The experiment results with varying patch sizes for the two experimental dataset are illustrated in Fig. 11. It is obvious that optimal block size ϖ is 24 for both BCCT200-resize dataset and original BCCT200 dataset, and ϖ = 26 achieves the best classification performance for the VAIS dataset. For k-means clustering, blocks are randomly selected and the codebook size is set to 1024.

Fig. 11
figure 11

Classification accuracy (%) versus varying block sizes for the three experimental dataset

4.2 Classification results and analysis

The proposed MFL strategy is compared with some state-of-the-art methods under the same experimental setup to verify the effectiveness of the proposed multiple feature learning approach. That is, for the BCCT200-resize dataset, 80% of the images from each class are used for training and the remaining images are used for testing; for the VAIS dataset, its “official” train and test data is used for training and testing, respectively. For the BCCT200-resize dataset, five-fold cross-validation is performed, in which the dataset is randomly partitioned into five equal and disjoint subsets. For each run, a different subset is held out for testing and the remaining four are used for training. The classification accuracy is the average over the five cross-validation evaluations. Specifically, the proposed method is compared with the feature fusion method that combines Gabor feature and MS-CLBP feature. The improvement of the proposed method over the Gabor + MS-CLBP and other existing methods verifies our multiple feature learning method is more effective. The comparison result for the BCCT200-resize, BCCT200 and VAIS datasets are shown in Tables 45 and 6, respectively. Obviously, the proposed method achieves superior classification performance over other existing methods, which demonstrates the effectiveness of the proposed MFL for vessel classification. Especially, for BCCT200-resize dataset, the MFL gains about 10% higher overall accuracy than deep learning method and the MFL gains about 4% higher overall accuracy for VAIS dataset than Convolutional Neural Networks (CNN). Therefore, the proposed approach, which combines multiple hand-crafted features, has obvious advantages. Moreover, excepting of typical SVM, we employ another classifier (i.e. extreme learning machine, ELM [21]) under the MFL framework. We find a phenomenon from Tables 45 and 6 that the feature-level strategy is preferable for VAIS dataset and the decision-level strategy is preferable for BCCT200-resize and BCCT200 datasets. For various ship image datasets, the decision-level and feature-level strategies have their own advantages and disadvantages, and it is hard to determine which one is always optimal.

Table 4 Comparison of classification accuracy (%) with some state-of-the-art methods for the BCCT200-resize dataset
Table 5 Comparison of classification accuracy (%) with some state-of-the-art methods for the original BCCT200 dataset
Table 6 Comparison of classification accuracy (%) with some state-of-the-art methods for the VAIS dataset

In addition, to demonstrate the enhanced discriminative power of multiple features fusion strategy, we compare the classification performance of the proposed multiple feature learning approach with the performance of methods that use each individual features in this framework with LIBSVM classifier. The accuracy per class from the aforementioned methods is reported in Tables 78 and 9. It is apparent that the proposed MFL outperforms all the individual feature based approaches. For the BCCT200-resize dataset, the global feature representation method, i.e., Gabor + MS-CLBP, achieves highest accuracy for cargo category. The first local feature representation method, i.e., MS-CLBP+FV, achieves maximum accuracy for container category in the original BCCT200 dataset. For the VAIS dataset, the second local feature representation method, i.e., BOVW+SPM, achieves maximum accuracy for Medium passenger category. Nevertheless, the proposed MFL obtains superior accuracy for others classes and also the highest overall accuracy for both two experimental datasets.

Table 7 Classification accuracy (%) per class for the BCCT200-resize dataset
Table 8 Classification accuracy (%) per class for the original BCCT200 dataset
Table 9 Classification accuracy (%) per class for the VAIS dataset

Confusion matrix of the proposed MFL with feature-level fusion strategy for the BCCT200-resize dataset is listed in Table 10, and that of the proposed MFL with decision-level fusion strategy is listed in Table 11. The confusion matrixes also are the average over the five cross-validation evaluations. The major confusion occurs between class 2 (i.e., cargo) and class 3 (i.e., container), since some cargo images is similar to the container images. Tables 12 and 13 show the confusion matrix of feature-level fusion and decision-level fusion for the original dataset, respectively. It is obvious that the major confusion occurs within class 3 (i.e., container), class 4 (i.e., tanker), and class 2 (i.e., cargo). Tables 14 and 15 display the confusion matrix of feature-level fusion and decision-level fusion for the VAIS dataset, respectively. It is observed that the major confusion occurs within class 2 (i.e., sailing), class 1 (i.e., merchant), class 3 (i.e., Medium passenger), and class 5 (i.e., tug), or between class 6 (i.e., small) and class 3 (i.e., Medium passenger). Sailing ships contain small sails up, small sails down and large sails down. Some sails down ships are similar to merchant ships. The small ships include speedboat, jet-ski, smaller pleasure and larger pleasure. Thus some small ships and medium passenger ships have relatively high similarity.

Table 10 Confusion matrix of the proposed MFL (feature-level fusion) for the BCCT200-resize dataset
Table 11 Confusion matrix of the proposed MFL (decision-level fusion) for the BCCT200-resize dataset
Table 12 Confusion matrix of the proposed MFL (feature-level fusion) for the original BCCT200 dataset
Table 13 Confusion matrix of the proposed MFL (decision-level fusion) for the original BCCT200 dataset
Table 14 Confusion matrix of the proposed MFL (feature-level fusion) for the VAIS dataset
Table 15 Confusion matrix of the proposed MFL (decision-level fusion) for the VAIS dataset

To evaluate the effectiveness of the decision-level fusion strategy (i.e., LOGP) of the proposed framework, we compare it with a popular hard fusion rule of majority voting [21]. Table 16 provides the classification results comparison between these two fusion schemes (i.e., LOGP and MV) using two experimental datasets. As evident from the results, the soft fusion strategy provides slightly better performance than the hard fusion rule of MV. The reason is that the fusion at the posterior probability level provides more flexibility than the hard counting, especially for multi-class classification tasks.

Table 16 Classification accuracy (%) of the proposed MFL using different fusion rules

5 Conclusion

In this paper, we proposed a multiple features learning classification framework for ship classification in optical remote sensing imagery. The overall classification framework consists of multiple features including Gabor-based MS-CLBP, patch-based MS-CLBP and Fisher Vector, and combination of the BOVW and SPM. These global and local features are complementary to each other and the combination of them is a powerful and comprehensive representation of ship images. It was found that the proposed method became more discriminative than the entire individual feature based approaches. Experimental results on two optical vessel dataset verified that the proposed decision-level soft fusion classification method consistently achieves superior classification performance over other state-of-the-art algorithms.

Given the recent tremendous success of deep learning technique, especially CNN, in image classification, a combine of deep learning features and hand crafted features for ship classification will be investigated in our future work.