1 Introduction

Underwater fish species identification and observation in the freshwater and pond water are in great demand for tourists. It is also very important for researchers, scientists and marine biologists who monitors the behavior of various species of fishes. Commercial applications such as fish farming is highly dependent on observing the fish species and their habitat for breeding similar fishes and also study their life cycle. Climate and environmental changes have a great impact on the fish species and fish habitats. Manual methods are usually time-consuming and requires a lot of effort to acquire samples in different environments.

The identification and recognition of fish species is becoming a challenging topic for research. Many challenges are faced that include distortion, noise, segmentation error and occlusion. Earlier scientists and researchers were restricted to controlled environments only [1]. Mostly the research is based on ground objects. Nevertheless, the demand for aquatic species identification and recognition has increased rapidly.

In recent years, a large number of machine learning algorithms have been designed and implemented for underwater species classification [2]. The algorithms mainly perform classification on dead fish samples based on shape and texture information [3, 4]. Storbeck et al. [5] used laser light to identify features such as length, width and thickness of numerous species for 3D modeling of fishes. Unconstrained classification of fishes is more challenging and difficult task in environments where variation in luminosity, background confusion among reef and aquatic plants and turbidity of water is viable. The similar shape, texture and color of various fish species is considered another challenging task for the accurate classification of species. Spampinato et al. [3] and Roya et al. [6] presented two different classical methods to classify fish species based on shape and textural patterns in natural and unconstrained environments. Shafait et al. [7] used recorded videos for identification of fish species in uncontrolled environments based on its abundance and biomass content. Hernández-Serna and Jiménez-Segura used ANN for automatic identification of fish species based on the morphology, texture and geometry [8]. Huang et al. [9] presented a hierarchical classification method to recognize live fish in the open sea. Sun et al. [10] proposed a method for recognition of fish species from low-resolution images.

Nagashima et al. [4] used two features: speckle patterns and scale fishes and proposed a method based on morphological algorithms and filters. Sparse representation classification (SRC) combined with PCA was applied by Hsiao et al. [11] for classification of 25 different fish species having an accuracy of 81.8%. Huang et al. [12] employed Gaussian mixture model and support vector machines (SVM) to train the fish species, achieved a recognition rate of 74.8% on a dataset of 15 different fish species containing around 24,000 images. In late twentieth century artificial neural networks (ANN) were first introduced, but became unpopular because it require high levels of supervised training and unable to solve extremely difficult and complex problems.

Convolution is a famous operation used in the field of computer vision and signals processing. These days’ computer vision experts commonly use the convolution operation for noise reduction and edge detection [13]. Convolutional neural networks (CNNs), a special type of ANN which gained its importance in a wide range of applications in the areas of artificial intelligence and machine learning. CNNs and their variants achieved promising results for handwritten digits classification, object recognition and facial recognition. However, they gradually lost their importance due to hardware constraints and consumption of large memory, also the availability of large amount of data [14]. With the technological advancement, it has become easier to train deep and complex networks, the processing power has also increased with the development of powerful GPUs. Many artificial intelligence and machine learning researchers are working on these complex models which enable them to learn and extract complex features. This initiate the development and usage of first deep learning model. AlexNet [15], VGGNet [16], GoogleNet [17], ResNet [18] and it variants are very popular deep learning models. Due to the vast improvement in visual recognition and detection, deep learning has accomplished significant results on different categories [19].

Earlier researchers face problems in achieving satisfactory results. Several reasons make it problematic, samples of fish species were taken under unnatural conditions, and datasets were usually small in number. The recognition accuracy is worse under different environmental conditions.

In this paper, we presented and proposed a robust and automatic fish species classification system especially for pond farming and understanding the fish habitats. The methodology used for classification is based on deep convolutional neural networks (D-CNN) that uses three different environments as mentioned in Sect. 3. The proposed model uses a modified version of AlexNet that provide better results when less number of layers are used.

2 Convolutional Neural Networks

Deep learning is a field of machine learning which learn high-level abstractions in data by using hierarchical architectures. Moreover, as the number of layers increase the data representation also gets improved [20]. In deep learning the distinct features and classifier are trained at the same time. The feature extraction is carried out by the initial layers containing filter banks, nonlinear transformation, and the pooling layers. Whereas, the classification is performed by the top layers known as fully connected layers. Several object recognition systems usually use this mechanism for feature extraction and classification [21].

Convolutional neural networks (CNNs) is the most prominent deep learning methods in which the multiple layers are trained and tested in a robust manner. A typical CNN consists of three main layers; convolutional layers, pooling layers, and fully connected layers. The convolutional layer is used to extract the basic and local features. Usually the convolutional layer is followed by a non-linear processing layer which enables the network to capture the non-linearity present in the data [22]. The pooling layer is aimed to reduce the size of the feature maps [23].

Convolutional neural networks (CNNs) have the ability to extract information based on their color, shape and texture when the datasets have large variations in terms of background and the objects present in the images. Therefore, the visual patterns can be easily trained and learned by the networks. The network’s generalization capability increases when the number of samples of the particular object increases. This generalization capability enables the network to train and classify the information that is never used before for training [24]. AlexNet, VGGNet, ResNet are few leading versions of pre-trained deep convolutional neural network. The use of these pre-trained networks are increasing and can be applied to many different applications.

2.1 AlexNet

AlexNet is one of the most popular deep CNN used for visual recognition and classification applications. The training set contained 1.2 million labeled images of 1000 different objects from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset. The AlexNet deep architecture contain 60 million parameters and 650,000 neurons [15]. It consists of five convolutional layers, three max-pooling layers, three fully-connected layers and a classifier layer as an output layer. The input layer takes the images of size 227 × 227 × 3.

2.2 VGGNet

VGGNet [16] is another well-known example of deep CNN. The interesting feature in this architecture is the usage of stacks of smaller sized convolutional kernels. The VGGNet architecture contains around 144 million parameters, including 16 convolutional layers accompanied by very small convolutional filters (3 × 3), five max-pooling layers, three fully-connected layers, and a classifier layer as an output layer. The depth of the network was increased by adding more convolutional layers. VGGNet design is based on the philosophy that deeper-is-better. Usually these deep networks are difficult to train and also require a large amount of memory [25].

3 Fish Dataset

The dataset used for this research work is taken from the QUT fish dataset [26]. It was used to make a comparison among the deep learning structures. This dataset was used for the very first time in [26] for a method named Local ISV. Local ISV is classification method in which the feature extraction, training and testing operation uses different classes of data. For that reason it is impossible to compare its performance directly with deep learning structures.

The QUT fish dataset contain 3960 images captured in different environments. The images are divided into three categories: “controlled”, “out-of-water” and “in situ”. The images contained in the “controlled” environment are of several types of fish species taken with a constant background. The images contained in “out-of-the-water” category are captured out of the water without any background changes and the illumination conditions are also very limited. The images contained in “in situ” category are captured underwater in its natural environment.

In this research study, six fish species were selected taken in different conditions. The width and height of the images varies which need to be scaled according to the AlexNet model. The sample species are shown in the Table 1. LifeClef2015 Fish dataset is used for the testing purpose [27]. This dataset contains around 20,000 images which were divided into 15 different classes of fish species [27]. The dataset contains different number of images available for each species. We have selected the same six fish species for training. The testing dataset was divided into testing and validation images. The testing images contain 20% of the images from the total number of images in each class. For validation, we randomly selected the images and took about 15% of the images from the total number of images in each class.

Table 1 Distribution of fish species for training, validation and testing along with sample images

4 The Proposed Model

The architecture of the model for the underwater fish classification used for the fish farming is introduced in Fig. 1. The proposed model is a simplified version of AlexNet [15]. The deep architecture of AlexNet is preferred over others because it contains less number of layers and the training and validation accuracy is over 90%. It is modified and reduced from the original AlexNet in order to limit the complexity of computation that is training and testing, number of parameters and large memory. The deep convolutional neural networks have several advantages over other traditional methods especially radial basis function neural network. The weight sharing in convolutional layers reduce the number of parameters and make easier to detect edges, corners and blobs. The use of pooling layer provides invariance to changes in position and location of the extracted features.

Fig. 1
figure 1

Illustration of the proposed model for the identification and classification of fish species

The input layer is the first layer which takes the image of size 227 × 227 × 3 and passes through the first convolutional layer having 96 feature maps and the size of each kernel is 11 × 11 with a stride of four. The dimensions of the image changes to 55 × 55 × 96. After passing through a non-linear activation function (ReLU) and max-pooling layer with a filter size 3 × 3 with a stride of two. The image dimensions are reduced to 27 × 27 × 96. The second convolutional layer takes the output of the previous layer as input with 256 feature maps and the size of each kernel is 5 × 5 with a stride of one. Then again it is passed through a non-linearity function (ReLU) and max-pooling layer with a filter size of 3 × 3 with a stride of two so the output is now reduced to 13 × 13 × 256. The third and fourth convolutional layers are connected back to back with a filter size of 3 × 3 having a stride of one. The third convolutional layer used 384 feature maps while the fourth convolutional layer used 256 feature maps followed by a max-pooling layer with a filter size of 3 × 3 with a stride of two. The output of this convolutional layer is flatten through a fully connected layer with 9216 feature maps which is connected again to a fully connected layer with 4096 units. The last layer is the output uses softmax layer with six units according to our classes in the dataset. The overall summary of the proposed model with number of layers and its configurations are mentioned in Table 2.

Table 2 Summary of proposed model with layers and its configurations

The proposed model was build and implemented on a Tensorflow platform. The learning rate was set to 0.001 and weight decay initialized to 0.0002.

5 Experimental Results

The dataset used for testing and evaluating our proposed model was from an untrained benchmark fish dataset (LifeClef’15). The data augmentation is also incorporated where the testing images are very less. The techniques of data augmentation used are rotation, flipping and zooming. The comparison was made against few deep learning models that is AlexNet and VGGNet. The functional parameters used for comparing different models were: number of convolutional layers, number of fully-connected layers, number of iterations, number of batches and inclusion of dropout layer. Table 3 shows the comparison of results based on other deep learning models.

Table 3 Comparison of results based on other deep learning models and our prosed model

The first parameter used to compare the results is the number of convolutional layers and number of fully-connected layers, using less number of layers for these two against other deep learning models mean it has less computational power in terms of training, validation and testing and memory. The second parameter used for comparison is the number of iterations in the training stage when the model has achieved 100% accuracy. Our proposed model has taken large number of iterations in achieving 100% accuracy for training, it was obvious as the model was never trained before in comparison to other architectures. The third parameter used for comparison is the number of batches, first we have implemented batch size equal to 10 and observed the validation and testing accuracy for our proposed system achieved 94.23% and 87.35% respectively. After increasing the batch size to 20, the validation and testing accuracy have increased. The validation and testing accuracy are 96.28% and 88.52% respectively. The fourth parameter used for comparison is the inclusion of dropout layer, it can be observed that including the dropout layer before the softmax classifier layer the validation and testing accuracy of our model has increased and achieved an accuracy of 98.2% and 90.48% respectively. The proposed model has achieved the best accuracy of 90.48% for the testing data. This accuracy has outperformed the AlexNet accuracy which was achieved 86.65%. Figure 2 illustrates the validation and testing accuracy of our proposed work when the batch size gets increased and the inclusion of dropout.

Fig. 2
figure 2

Validation and testing accuracies of our proposed work

Although it is a not a significant change as all the other deep models were trained before and it took hours to train and learn the weights of the model. The proposed system is unable to outperform the validation accuracy of VGGNet but it has performed better in the testing accuracy against AlexNet. It should be noted that the model presented has used less number of layers making it less computational complex. Furthermore, it is designed to identify freshwater fish species taking less time in classification and making it more robust for real time applications of fish species.

6 Conclusion

In this paper, we proposed an automatic fish species classification system based on deep convolutional neural networks present in freshwater. It will help the researchers and marine biologists to understand the phenomenon of underwater species life cycle and habitat of the species to farm the fishes in the pond easily. It will also help the fish stockists and marine park managers for the conservation of fish species. In this research work, we have introduced a simpler version of AlexNet comprising of four convolutional layers and two fully-connected layers. A comparative study of other deep models are performed and it shows that our proposed model was unable to outperform in the validation and testing accuracy for the VGGNet as they are considered very large and deep neural networks which have already been trained on millions of images in comparison to our proposed system which only have 1344 images. The inclusion of dropout layer before the softmax classifier has a great impact in increasing the performance of the model. The proposed model outperforms AlexNet in the testing accuracy. It achieves an accuracy of 90.48% while AlexNet achieves 86.65% having less number of training images, less computational power and less memory. In future, this work can be extended to the underwater species facing environmental challenges, water turbidity, and background confusion. We can further improve our classification method for real time monitoring of underwater fish species.