1 Introduction

In the last decades, the extensive research on object recognition has mainly focused on distinguishing object classes which are visually dissimilar [8, 10]. Recently, computer vision researchers have put a lot of effort into recognizing sub-ordinate object classes, a.k.a. fine-grained object recognition, as this poses several challenges because of the lack of strongly-discriminative features.

Automated visual systems performing such tasks might provide significant support to many applications, especially those requiring specialized domain knowledge (e.g. ecology): indeed, most people can easily distinguish between a person playing a clarinet from one holding a clarinet [18], while it is much more difficult to distinguish between plant types or animal species, where inter-class similarity might be very high.

Fig. 1
figure 1

From left to right, examples of Acanthurus nigrofuscus, Chromis margaritifer and Dascyllus reticulatus fish species: how they appear in an encyclopedia (first row) and how they appear in our dataset (second row)

Moreover, especially for the ecological context, the need for such automatic tools has become even greater due to technological advances (remotely-operated vehicles, stationary non-invasive cameras, as well as the overall reduction of device costs) which led to the collection of massive datasets, whose analysis requires automated methods as it cannot be done by human operators [3].

However, to perform automatic or semi-automatic visual tasks, it is first necessary to have large annotated datasets, whose collection requires human efforts, specialized domain knowledge (especially for fine-grained object recognition) and often money, as shown by several recent methods exploiting platforms such as Amazon Mechanical Turk.

The proposed work addresses both the problem of automatically annotating and performing fine-grained visual tasks in domain-specific datasets. In particular, we will focus on underwater visual data using a large fish image dataset built by processing [28] over one million 10-minute video clips taken for fish biodiversity monitoring within the Fish4Knowledge project.Footnote 1

The difficulties of fish-species classification in our dataset lies in the fact that fish species, especially within the same family, differ by few phenotypic features and even images of different fish species may look very similar because of 1) fish high deformability, 2) light propagation in water, and 3) low resolution of the recorded videos (due to network limitations in the monitoring stations). Fig. 1 shows three samples of Acanthurus nigrofuscus, Chromis margaritifer and Dascyllus reticulatus fish species, comparing how they appear in high-resolution images and in our dataset.

The main contributions of our work are:

  • Two fine-grained object classification approaches operating, respectively, on still images and on videos;

  • The introduction and the release of a new fine-grained fish image dataset, which complement the existing fine-recognition benchmarks [26, 37] and provides a useful basis to support marine life and its biodiversity investigation.

The remainder of the paper is organized as follows: in Section 2 a reviews of the most recent fine-grained object classification methods is carried out. Section 3 describes the employed underwater visual dataset, while fine-grained object recognition methods are reported in Section 4 with the performance described and discussed in Section 5. Concluding remarks and future developments are given in Section 6.

2 Related work

Recently, the computer vision, machine learning and multimedia scientific communities have addressed with increasing interest the problem of fine-grained recognition, i.e. categorizing objects which look very similar and differ only for subtle details, such as recognizing different species of animals (e.g dogs [26], birds [5]) and plants [19]. Of course, this task represents is rather challenging, because the discriminative features among the object classes are more difficult to identify; besides, one of the consequences of the novelty (and the complexity) of this task is the lack of datasets to be used for training and testing machine learning approaches. One example of fine-grained recognition is fish species identification at the species level. This task, per se, is not more difficult than other animal-based recognition ones, but their application in “real-life” context makes it very challenging because of the high variability of fish appearance and the limitation of the employed underwater imaging devices [12, 13].

Several techniques have been specifically devised to deal with fine-grained recognition [5, 9, 17, 35, 36], mostly focusing on the discovery of visual features which are more discriminative at the subordinate level. In [17], a combination of visual cues extracted from training images is used to build discriminative compound words. In [7], image patches are considered as discriminative attributes and a Conditional Random Field (CRF) framework is used to learn the attributes on a training set with humans in the loop.

Dense-sampling techniques have also been explored which decompose an image into patches and then extract low-level features from these regions to identify fine image statistics. In [38] the authors introduce “Grouplet”, a set of generative local dense features, which work reliably for human activity recognition. As follow-up, the same authors, in [18], improved the performance of their former method by fusing global and local information and using a random forest–based approach to extract dense spatial information. The limitation of these methods is their efficiency, as dense sampling feature spaces are often huge and increase many-fold when employing multiple image patches of arbitrary size. In addition, they are not able to operate with low-quality images, where the finer details necessary to pick the subtle differences between fine-grained object classes are missing.

An added difficulty for the existing methods is the lack of extensive training data: for example, the Caltech-UCSD Birds (CUB-200) [5] dataset contains only 6000 images of 200 different bird species, i.e. 30 images per bird species on average. It is proven [33] that large training datasets allow for effective nonparametric object recognition methods. However, the manual collection of large scale annotated datasets is a complex, tedious and expensive task, and although disparate techniques have been proposed for automatic labeling in the case of basic object recognition [14], the line of research that seems to be more effective for fine-grained recognition still involves the presence of human operators in the annotation loop: fine-grained labels are much more difficult to acquire and the automatic selection of discriminative features often results in the selection of irrelevant features. A promising direction is resorting to crowdsourcing solutions, as in [6], whose application has proved to be effective in discriminative feature selection, avoiding overfitting issues. However, in the case of fine-grained object recognition, the annotation process may require a deep understanding of the specific domain by the human operator (it is not feasible to ask non-experts to distinguish fish or birds or plants of the same family), thus limiting the applicability of crowdsourcing approaches.

3 Dataset

3.1 Underwater image and video dataset

For the experiments described in this paper, we employed the following datasets:

  • The U-20M dataset contains about 20 million unannotated (hence the U in the name) underwater images. This dataset represents the main body of images employed in our experiments.

  • The MA-35K dataset is a subset of U-20M containing about 35,000 fish images belonging to 10 chosen species. Each image was manually annotated (MA) with the corresponding species.

  • The AA-1M dataset contains about 1 million automatically-annotated (AA) images. This dataset is a subset of the images in U-20M which belong to the classes annotated in MA-35K.

  • The FishCLEF-31K dataset consists of about 30,000 images extracted from the AA-1M one and was used for the fish task [29] of the LifeCLEF 2014 benchmarking initiative [16]. Together with the fish images, the videos (about 400 10-minute videoclips) these images were extracted from, were also released.

3.2 Dataset collection

The U-20M dataset represents a randomly-selected fraction of the data collected in the context of the Fish4Knowledge project, amounting to more than a billion fish images extracted by processing over one million videos. In particular, these images contain generally one fish per image and represent the bounding boxes of the fish as detected by background modeling approaches [1]. The number of fish species in U-20M is unknown as the dataset is unlabeled, however, during the Fisk4Knowledge project it was noted that the 99.9 % of the observed fish belong to only 10 species (and for the remaining species it was very difficult to gather a significant number of image samples).

To create the U-20M dataset we, first, selected from the one-billion dataset only one image per trajectory (extracted via object tracking [30]) in order to avoid near duplicate images. On average each trajectory consisted of about ten fish detections and, by selecting only one image per trajectory we reduced the original dataset from one billion to about 150M images. The U-20M dataset was generated by randomly (uniform pdf) selecting 20M images from this last set. It is also important to notice that the random selection keeps the fish species distribution approximately constant.

The MA-35K dataset is a subset of the U-20M dataset, containing only (but not all) fish images belonging to the 10 most common species. Image annotation was carried out manually and validated by two expert marine biologists. In the original dataset, some species were more common than others showing the long-tail issue of data; although we tried to make the dataset as uniformly distributed across species as possible, for some of them (most evidently, Lutjanus fulvus) it was quite difficult to find a large number of adequate images, which resulted in a lower presence in the dataset. Figure 2 shows sample pictures of the chosen 10 species.

Fig. 2
figure 2

The 10 fish species analyzed in this work (images taken directly from the MA-35K dataset). From left to right and top to bottom: Acanthurus nigrofuscus, Amphiprion clarkii, Chaetodon lunulatus, Chromis margaritifer, Dascyllus reticulatus, Hemigymnus fasciatus, Lutjanus fulvus, Myripristis kuntee, Neoniphon sammara, Plectrogly-Phidodon dickii

The AA-1M dataset is also a subset of U-20M, obtained by the semi-automatic annotation approach described in [11]: briefly, images from MA-35K are used as queries to a similarity-based search in U-20M; after a check for false positives, achieved by means of mutual similarity in the retrieved images, the resulting images are then assigned the same species label as the query image.

The FishCLEF-31K dataset is obtained by selecting randomly about 31K images from AA-1M. We did not use either the AA-1M dataset since training classifiers on such a big dataset is impractical and computationally expensive and or the MA-35K one because it contains many near duplicate images. Table 1 shows the distribution of the images for the ten fish species and its partitioning as training and test set for the LIFECLEF 2014 contest. For video-based fish species recognition, we employed the videos the 31K images derive from, i.e. 401 10-minute videos.

Table 1 Distribution of fish species in the dataset used for the fish task of the LifeCLEF 2014 contest

All the datasets used in this paper are publicly available at http://www.perceive.dieei.unict.it/datasets/fish_recognition.

4 Fine-grained fish species recognition

In this section we present two approaches for fish species recognition in still images and in videos. In the case of videos, before classification, fish identification was performed by means of a background modeling approach.

4.1 Image-based fish species identification

Conversely to other fish-recognition approaches [12, 13], our fine-grained fish recognition performs a multi-scale analysis: from a fast one layer encoding-pooling scheme based on MB-LTP robust to illumination/color variation transformations to a fine-grained analysis via Fisher vectors and sparse coding trained on SIFT local features. The multi-scale aggregation via late fusion permits to reach a high level of classification rate (see Section 5). More in detail, our method is based on the classic unsupervised pipeline (see [20, 24, 25]), which consists of the following three steps:

  • local feature extraction

  • patch encoding

  • pooling operation (with a potentially spatial pyramid to improve local analysis)

These three steps constitutes a layer. Layers can be stacked together to obtain a more global representation. On the one hand, the more layers are stacked, the more the recognition system is invariant to complex image transformations, but on the other hand, if the number of layer is high, discriminability between classes can be lost. In practice, the number of layer is tuned to capture main image transformations common to all classes. On top of this architecture, a large-scale supervised classification is used, typically via linear-SVM or logistic regression algorithms.Footnote 2 In most computer vision approaches, only the encoding part is trained unsupervised from a random subsample set of local features and pooling parameter is fixed ad-hoc. In our work, we fixed the spatial pyramid of the last layer to 1×1+2×2 representing a total of 1+4 ROI to pool codes. In most of cases, the first layer is not trained but fixed according to some neuro-vision considerations. For example, SIFT patches [21] can be seen as a fixed sparse coding scheme of local gradient orientations where the sparsity level is fixed to 2 (see [4]) followed by a weighted pooling over local 4×4 windows. Local Binary Patterns [23] can be also considered to the sparse coding of local binary patterns where sparsity level is fixed to 1 and pooling is performed by histogram processing. For the image-based fish recognition, we devised three approaches: the first one using the 1-layer approach, and the other two exploiting 2−layer hybrid architecture.

4.1.1 Direct 1-layer approach

The first approach (corresponding to the first run in the results presented in the next section) is associated with a 1-layer architecture based on the approximated Local Ternary Patterns (LTP) propsed by Tan and al (see [32]). In many computer-vision applications, especially in face detection/recognition, Local Binary Patterns (LBP) and derivatives are known to offer very good results for a modest price. In [32], LTP codes are approximated by the aggregation of two LBP codes. We extended the direct LTP formulation to a multi-scale version where binary codes are computed over block of s pixels square instead of one unique pixel. We performed analysis with 3 block’size s∈{1,2,3}. We also compute LTP codes for each RGB color channels. The total feature size is 2×256×3×3×5=23,040. As mentioned earlier, this architecture is fixed and no training phase is required. The processing takes less than 0.05s per image on a modern laptop.

4.1.2 Stacked 2-layer approach

The second and third approaches are based on a 2−layer hybrid architecture where the first one employs either SIFT or LTP patches densely grid sampled and the second layer uses Fisher Vector (FV) [27] or Sparse Coding (SC) [39] framework as encoder.

In particular, the second method consists of a fusion between the approach describe above and Fisher Vector representations. Specifically, we sampled N x ×Δ y =25×25=625 SIFT patches per scale and per color channel, each of them computed over a M=24 pixels square block. 3 scales is used σ∈{0.5,0.75,1} and the 3 RGB color channel. In order satisfy the diagonal assumption of the inverse of the Fisher Information Matrix (FIM), a PCA is performed on extracted SIFT patches for each scale and for each color channel reducing dimension from 128 to 80. Following [27], we aggregated to each obtained patch, their normalized coordinates of the middle of the patch. The encoding part associated FV consists in computing local gradient wrt. means and variances of the log-likelihood with Gaussian Mixture modelling. We fixed G=32 gaussians and used 300,000 local patches to train the GMM. The global FV representation is obtained by average pooling on each earlier local representation over each windows of the spatial pyramid and for each scale and color channel. The total feature size is (80+2)×2×32×3×3×5=236,160. Late fusion is performed by averaging posterior probabilities obtained during linear SVM training.

The last method instead fuses LTP, FV and SC representations. For SC, LTP patches instead of SIFT are chosen given their better representation capacity (see [25]). LTP patches are extracted densely in the same way as for the SIFT ones over ROI of 24 pixel square for 3 different scales and for the 3 RGB channels. Dictionaries for SC are trained for each scale and color channel. We used the classic dictionary learning procedure alternating dictionary codebook optimization with sparse codes updates. We employed block-coordinates descent (see [22]) approach for the codebook optimization adding an extra positivity constraints for each dictionary’s elements. Sparse codes are updated by LARS algorithm given the current estimated dictionary adding an extra positivity constraints on sparse codes. The number of dictionary elements was fixed to K=1,024 and used 300,000 local features to train each dictionary. The total feature size is 1,024×3×3×5=46,080. For the pooling stage, we used p -norm pooling fixing p=3. It showed that p=3 is offering best results over average pooling (p=1) and max-pooling (p=).

Figure 3 describes the pipelines employed for the image-based fish species recognition task.

Fig. 3
figure 3

Summary of the approaches devised for fish species recognition: 1) Approach 1 provides one pipeline results (light blue block), b) Approach 2 aggregates two pipeline results by late fusion (light blue + purple blocks) and 3) Approaches 3 aggregates three pipeline results (light blue + purple and green blocks) by late fusion

4.2 Video-based fish species identification

Our video-based fish identification consists of two steps: 1) key-points are extracted from the fish images provided in the training set in an off-line fashion, and 2) the precomputed key-points are matched against dense key-points extracted from candidate fish (extracted by a background modeling approach) for fish species classification. The flowchart of this step is shown in Fig. 4.

Fig. 4
figure 4

The video-based fish identification flowchart: on-line module

Off-line extraction of key-points from fish images

For each video frame in the training set, we compute three groups of key-points, using the Opponent SIFT color descriptor [40] at different scales, on the central horizontal axis and located at the 1/3, 1/2 and 2/3 of the horizontal length. Scales are computed starting from fish bounding box size (provided in the ground truth) suitably increased or decreased in order to exactly contain the whole fish and its main parts: i.e. head, body and tail. We adopt this strategy of hand-crafted large key-point descriptors owing to the low resolution of the videos and because classical detectors give generaly smaller key-points. The detection of key-points is described in Fig. 5.

Fig. 5
figure 5

Key-point detection in fish images

On-line video-based fish species identification

The first step of our online module detects moving objects from background-foreground segmentation maps obtained by the adaptive background mixture model described in [40]. These masks are then post-processed with morphological operators: an circular erosion of radius 3 then a circular dilation of same radius.

In the training phase, starting from the background/foreground masks we build bounding boxes for the detected fish and then we train one SVM classifier for each fish species with the descriptors of detected fish as positive samples and Opponent SIFT descriptors of the background as negative samples. Specifically, key-points are densely extracted from the background with fixed scales from 30 to 110 pixels of diameter with respect to the size of the video. In order to avoid that background key-points (due to large bounding boxes) are used for the SVMs training, we filter out positive key-points: for each key-point inside the bounding boxes, we look for its 10 nearest neighbors and keep the best key-points (with more positive neighbors) to reach 50 percent of positive extracted descriptors.

In the test phase, for each detected blob from a video, key-points are densely extracted at several scales (fixed scales from 30 to 110), and likelihood scores are computed as the distances between the descriptors and the SVM decision boundary of each fish class. Only positively classified points with a distance larger than a threshold 1 are considered and the scores associated with each fish species are summed up in order to obtain a global likelihood score per species per blob: the fish species with the highest score is then assigned to the blob.

5 Performance evaluation

The approaches described in Section 4 were tested on the FishCLEF-31K dataset within the fish task of the LifeCLEF 2014 benchmarking initiative. In particular, the LifeCLEF 2014 Fish task (a.k.a. FishCLEF 2014) aimed at benchmarking automatic fish detection and recognition methods by processing underwater visual data. It, basically, consisted of two tasks:

A video-based task – detecting fish instances in key video frames and recognizing their species;

An image-based task – to identify fish species using only still images containing only one fish instance.

Baseline

In order to provide a baseline for our fish species classification methods we tested 1) the VLfeat BoW [34] classification method (generally used as baseline for fine-grained recognition tasks [26]) over our FishCLEF-31K dataset for the task of recognizing fish in still images and 2) the ViBe background modeling approach [1] for fish detection in videos (it proved to operate effectively with underwater videos [3]) followed by the VLFeat + BoW for fish species recognition. In both cases, performance evaluation was carried out by computing average precision and recall, and precision and recall for each fish species and the results are shown in Table 2. In particular, for the image-based task only the recall was assessed (since precision was one for all the considered species), while for the video-based task we also computed the precision as the probability of identifying background areas as fish. When computing precision and recall for both tasks we taken into account only the most probable class for each image.

Table 2 Fine grained classification baselines on the FishCLEF 31K dataset

Image-based fish species recognition [15]

This section reports the performance achieved by the three methods described in Section 4.1 on the image-based task. Each method employed different features, namely: 1) only LTP 2) LTP + Improved Fisher Vectors (IFV) and 3) LTP + IFV + SC. For all the three methods, the last global representation is pooled on 1×1+2×2 spatial pyramid. The results, in terms of recall, of the three approaches are reported and compared to our baseline in Fig. 6. As for the precision, the approach yielded a precision of 1 for almost all species expect for Chromis margaritifer (0.96), Dascyllus reticulatus (0.97) and Plectrogly-Phidodon dickii (0.98) species. Form recall performance, it is possible to notice that the fast LTP approach (1-layer architecture) outperforms the baseline by a large margin (except for the first specie), i.e. The other two approaches (2 and 3) slightly improved results because only few examples were misclassified by the first one.

Fig. 6
figure 6

Image-based fish species recognition results [15] in terms of recall on the FishCLEF 31K dataset

Video-based Fish Species Recognition [2]

The results, in terms of precision and recall, achieved by the video-based fish species recognition approach, described in Section 4.2, are shown in Figs. 7 and 8. For this task, the approach first discriminates fish from background and then assigns to each detected fish a species. A detection was considered as a true positive if the PASCAL score between it and the corresponding object in the ground truth was over 0.3. The employed parameters were T=0.5 and M=10 (see Section 4.2). Also for this task, we used three different settings based on how blobs detected by the fish detection modules were treated: 1) blobs as coming out from the background/foreground segmentation mask; 2) fish occlusion (more fish in one bounding box) separated by resorting to color features and 3) small blobs with a spatial and color coherence merged.

Fig. 7
figure 7

Precision of our video-based fish species recognition results [2] on the FishCLEF 31K dataset

Fig. 8
figure 8

Recall of our video-based fish species recognition results [2] on the FishCLEF 31K dataset

While the average recall obtained by our method was lower than the baseline’s recall, the precision was much improved, thus implying that our classification approach based on key-points was more reliable than the fish detection baseline [1]. The reason behind the low recall may be found in the size of the computed bounding boxes when processing a video (see Fig. 9): bigger bounding boxes may contain also background objects and/or other fish instances, thus affecting the performance of the species classification method.

Fig. 9
figure 9

Matching between the bounding boxes computed by our video-based fish identification method [2] (in red) and the ones (in green) provided in the FishCLEF 31K dataset

To demonstrate the effectiveness of this method in discriminating fish species, we adapted it to work with still images and compared (see Fig. 8) its performance to the ones achieved by, respectively, the method devised for the image-based task and the baseline. The results are shown in Table 3: the approach employed for the image-based task is still the most performing one, followed by the one developed for the video-based task, which was able to outperform a powerful approach such as VLFeat + BoW. However, the approach based on the extraction of keypoints is more efficient, though less accurate, than the approaches based on LPB, and this makes it particularly suitable for applications where efficiency is a key requirement.

Table 3 Overall fine grained classification accuracy

6 Concluding remarks

In this paper we have introduced a large dataset for the fine-grained recognition problem of identifying fish species from images and videos. We have developed an effective nonparametric approach for automatic label propagation. The automatically-labeled dataset (suitably filtered) was used for benchmarking fish species recognition approaches within the fish task of the LifeCLEF 2014 initiative. Two methods for fish species recognition were also described: the first one dealing with still images and the second one with videos. Both methods achieved very high performance, although dealing with videos is much more complex as it needs reliable methods for detecting moving objects [31]. Our dataset at the moment contains only 10 fish species representing the most observed species in our underwater visual dataset. This, of course, simplifies our fine-grained recognition problem compared to other fine-grained classification benchmarks that contain much more classes, e.g. plant species [19] or bird recognition [5]. However, the extremely low image quality (the aforementioned benchmarks for fine-grained recognition contain only high resolution images) makes the classification task not trivial, especially in the case of videos. We are currently working on increasing the number of species up to 100 in order to see how the proposed methods behave in case of long-tail data distribution and to have fish families sharing common phenotypes, thus making the fine-grained classification task much more complex.