1 Introduction

Wild animal monitoring systems are gaining importance due to the number of Human–Animal Conflict (HAC) that has been occurring over the decades. With visible images, it is tedious to detect animals during night time due to the engulfed darkness. Besides, animals could self-mask with their flexible structure and their cluttered background adds to the complexity. However, thermal imaging cameras are one of the perfect tools for night vision applications. They work on the principle of heat energy and so they can detect hot-blooded organisms like human, animals very easily, besides ignoring camouflage. Although camera traps are one of the best tools for capturing the animals, they do not always capture perfect images and they do pose several challenges like, noise, low contrast/illumination, haze/blur, occlusion, camouflage, background clutter, and pose variations leading to poor interpretation. As a counter measure, we extract the invariant features which are not deterred by these challenges. Invariant features are a special type of features which can identify the animals precisely even when they appear slightly different from their original form, thus making them robust to any challenging image conditions.

The number of animal detection models with thermal images is relatively low and the primary reason is the lack of publicly available thermal animal dataset. Yet, thermal images have been used in few animal studies like studying the population of Brazilian free-tailed bats [1, 2], detecting diurnal terrestrial reptiles [3], identifying large mammals in African bushveldt [4], welfare monitoring system of rodents [5] and Deer–Vehicle Collision detection system [6, 7]. Besides, Cilulko et al. [8] reviewed the various uses of thermal images in wildlife studies. Real-time vehicular animal detection system with thermal cameras was proposed for Autoliv cars [9] and is now used in Audi, BMW and Daimler cars. With the Fuzzy Inference System (FIS), we can distinguish the severity of a tiger cub from an adult tiger and this kind of crisp inference is highly essential in real-time applications involving human lives. Although fuzzy logic has been used in a wide variety of applications like text recognition [10], moving object detection [11], human detection [12], diagnosing disease in rice [13], detecting estrus in cows [14], pedestrian detection [15, 16], and vehicle detection [17]; it has not been studied on animal-based applications due to the complex characteristics of animals. This motivated us to propose the first of its kind fuzzy logic-based animal detection and recognition system using thermal images. The contributions of this work include.

  1. (1)

    Developing an efficient animal detection and recognition system with invariant features and fuzzy logic using thermal image. The model is robust to several challenging image conditions.

  2. (2)

    To investigate the influence of discrete features and feature fusion in detecting animals. Also, to investigate the effects of distance between the thermal camera and animal.

  3. (3)

    To investigate other notable factors like weather conditions required for achieving best results with thermal images.

The remaining of the paper is organized as follows: Sect. 2 describes the proposed methodology. Section 3 discusses the experimental framework including results and discussion. Finally, the conclusion and future scope are presented in Sect. 4.

2 Proposed Methodology

In this section, we discuss image pre-processing followed by feature extraction and finally the fuzzy inference system for recognizing the animals from thermal images. The proposed model is entitled “IF-FUZ” with reference to the invariant features and fuzzy logic. Figure 1 depicts the flow of the proposed model.

Fig. 1
figure 1

Flow of proposed IF-FUZ

2.1 Image Pre-processing

The raw images captured from the thermal camera are not appropriate for recognition. We follow a series of steps for pre-processing the image [18]. First, we use Gamma correction to correct the luminance of the image followed by histogram equalization to enhance the contrast. Next, we apply gradient-based guided edge-aware smoothing filter to smoothen the image. The edge-aware filter preserves the edges of the image and retains the structure of the animal despite smoothing. Finally, we segment the object of interest by applying basic thresholding function. The outcome from each of the pre-processing step is given in the top portion of Fig. 1.

2.2 Feature Extraction

The feature extraction involves extracting the most informative data from the image. Different features have different capabilities and we consider a fusion of features. The first set of features to be extracted is the Zernike feature which is invariant to scaling, translation, and rotation. Animal do not pose for cameras and they can be in any form when captured. Hence, the invariant property of Zernike is the one of the most appropriate features for animals. The next feature captures the skeleton properties of the animal. In some cases, when the animals are de-formed or occluded, the skeleton information is most informative as it is robust to shape noise, deformation and non-rigid transformations. Illumination problem is yet another factor affecting the animal recognition. In such cases, illumination invariant features like Local Binary Pattern (LBP) can be useful. Finally, we also extract the shape of the animal using a fusion of Statistical model-based methods (SMM) and Multi-index Active Model (MAM).

2.2.1 Zernike Features

Zernike moments which are based on the Zernike polynomial are orthogonal in nature and so invariant to rotation. The Zernike moment for a given ordered pair (m, n) is defined as

$$V_{nm} \left( {\rho ,\theta } \right) = R_{nm} \left( \rho \right)e^{im\theta } , \theta \le 1$$
(1)

where ‘m’ denotes the order of Zernike polynomial and ‘n’ the multiplicity/repetition of the phase angle. The \(\rho\) and \(\theta\) in the above equation are given by

$$\rho = \sqrt {x^{2} + y^{2} } ;\theta = { \arctan }\left( {\frac{y}{x}} \right)$$
(2)

where \(\rho\) is the radial vector of the image pixel and \(\theta\) the angle. \(R_{nm}\) in Zernike moments is the Zernike polynomial and is given by

$$R_{nm} \left( \rho \right) = \mathop \sum \limits_{a = 0}^{{\frac{{\left( {n - \left| m \right|} \right)}}{2}}} \left( { - 1} \right)^{a} \frac{{\left( {n - a} \right)!}}{{a!\left( {\frac{n + \left| m \right|}{2} - a} \right)!\left( {\frac{n - \left| m \right|}{2} - a} \right)!}}\rho^{n - 2a} .$$
(3)

For a given continuous function \(f\left( {x,y} \right)\), the Zernike moment for n, m is given by

$$A_{nm} = \frac{n + 1}{\pi }\mathop \sum \limits_{\rho = 0}^{1} R_{nm} \left( \rho \right) \mathop \sum \limits_{\theta = 0}^{2\pi } f \left( {\rho .\theta } \right) e^{ - im\theta } .$$
(4)

To have a complete feature set, we divide the image horizontally into two sub-regions (left and right) and combine the features from left \(L\left( {x_{n} ,y_{n} } \right)\) and right \(R\left( {x_{n} ,y_{n} } \right)\) into complete feature vector \(I\left( {x,y} \right)\). We calculate the higher and lower order Zernike moments (see Table 1) using Eqs. 5 and 6.

$${\text{Lower}}\;{\text{order}}\;{\text{ZM}} = \left\{ {Z_{n,m} } \right\}\forall \left\{ {\begin{array}{*{20}l} {2 \le n \le 7} \\ {\left| m \right| \le n} \\ {n - \left| m \right| = 2k} \\ {k \in N}. \\ \end{array} } \right.$$
(5)
$${\text{Higher}}\;{\text{order}}\; {\text{ZM}} = \left\{ {Z_{n,m} } \right\}\forall \left\{ {\begin{array}{*{20}l} {7 \le n \le 12} \\ {\left| m \right| \le n} \\ {n - \left| m \right| = 4k} \\ {k \in N}. \\ \end{array} } \right.$$
(6)
Table 1 Higher and lower order Zernike moments

2.2.2 Skeleton Feature

The skeleton structure is an object with many disks where only one disk will have maximal disk space and it is the locus of Center of Maximal Disks (CMD). We represent the skeleton graph in line form (see Fig. 3a) and the points on the skeleton are used to measure the distance and to check if the line segments are CMD or not. The skeletal point connected to only one other skeletal point is Ending node (\(E_{\text{node}}\)) and the skeletal point connected to three other skeletal points is Bifurcation node (\(B_{\text{node}}\)). In Fig. 2a, nodes A, B are \(E_{\text{node}}\) and nodes C, D, J are \(B_{\text{node}}\) and the remaining are normal nodes. Skeleton edges (SE) are formed by connecting the skeletal points between two \(B_{\text{nodes}}\) or two \(E_{\text{nodes}}\). Primary SE is one with two \(B_{\text{node}}\) and all other nodes are normal edges. Node CD is primary SE and node AC is normal SE (see Fig. 3b).

Fig. 2
figure 2

Invariance property of Zernike features

Fig. 3
figure 3

a Skeleton graph, b skeleton edge

The skeleton processing involves carefully extracting the skeleton as it is prone to shape noise and deformation. We prune unnecessary SE that occurs due to shape noise. To check if an SE is important or not, we compare each of the normal SE with original skeleton shape. If the difference is large, the SE is due to shape noise and it can be pruned; otherwise, it can be considered for the skeleton graph. The obtained skeleton can be further processed by representing the skeleton in tree form. The skeletal tree can be generated by tracing either the \(B_{\text{node}}\) or \(E_{\text{node}}\). The nodes and edges for the tree are \(B_{\text{node}}\) and primary SE, respectively. The topology of 2D skeleton differs even for the same shape objects. To overcome this challenge, we use the skeleton path which is robust to shape noise. Let \(S\) be the skeleton and \(e\left( S \right) = \left\{ {e_{i} , i = 1,2, \ldots M} \right\}\) be the end points of S where M denotes the total number of end points. In Fig. 3b, we have 10 end points. Skeleton path \(p_{i,j}\) is the shortest path between the end points \(e_{i }\) and \(e_{j}\) and this is given by \(p_{i,j} = \left( { e_{i } , e_{j} } \right)\). They are very informative and robust as they avoid unnecessary \(B_{\text{node}}\) through pruning. Moreover, despite the deformations and non-rigid transformations, skeleton path is same for similar shaped objects, as they depend on CMD rather than the topology of the object.

2.2.3 Local Binary Pattern (LBP)

The grayscale variant of thermal images may suffer from monotonic gray level changes called illumination variation. LBP is a highly discriminative texture descriptor that is invariant to illumination effects as illustrated in Fig. 4.

Fig. 4
figure 4

Illumination invariance of LBP

LBP works by assigning a binary number to every pixels of the image. This binary number is typically called as label and it represents the relation between the center pixel and the corresponding neighboring pixel. This can be mathematically represented as

$${\text{LBP}}\left( {x_{c} ,y_{c} } \right) = \mathop \sum \limits_{p = 0}^{p - 1} 2^{p} \times S\left( {i_{n} - i_{c} } \right)$$
(7)

where \(\left( {x_{c} ,y_{c} } \right)\) is the center pixel, \(i_{c}\) its intensity and \(i_{n}\) the intensity of the neighboring pixel. \(S\) is the sign function which is equal to 1 if \(x \ge 0\) else it is 0. The texture of the image can then be analyzed from the histogram of the labels.

2.2.4 Shape Feature Extraction

The animals have pose variation due to scaling, orientation, and rotation. To have a good recognition, the extracted features should be invariant to all the above. We utilize a hybrid variant shape extraction techniques based on SMM and MAM [19]. This method transforms the edge contour of the image to a specified position, orientation and scale. The variant methods are normally based on the reference shape, which is obtained by taking the average shape of few defectless images.

2.2.4.1 SMM-Based Shape Feature Extraction

SMM relies on reference image for the shape feature extraction. The reference image is normalized by adjusting the position of the object to the center and also by adjusting the scale and orientation pertaining to the corresponding standard deviations. The intersection of x and y co-ordinates is the center point and the corresponding edge points are taken as the object’s center. By sloping the major axis horizontally, we change the orientation. The original dimension \(\left( {x_{i} ,x_{j} } \right)\) is scaled to new dimension \(\left( {x_{i}^{'} ,y_{i}^{'} } \right)\) based on the standard deviation \(\left( {{\text{SD}}_{x} ,{\text{SD}}_{y} } \right)\) of the reference shape. New dimension is calculated by dividing the standard deviation from the old dimension. Further, we normalize all the images before comparing it with the normalized reference shape.

2.2.4.2 MAM-Based Shape Feature Extraction

MAM feature extractor is also based on reference shape but it is an active method, as we can adjust the position, orientation and scale of the object in a way that it approximates the reference shape of the object. Shape transformation is illustrated with lion in Fig. 5. The outer layer is the actual edge contour of the original image. The inner most layer is the normalized edge contour and the shape near the normalized edge contour is the reference shape for the object. The line passing through the geometrical center, origin, tail and the forehead of the lion is taken as x-axis. From the forehead of the line, a random number of equiangular positions are taken. The figure illustrates an arbitrary position \(\left( {\theta_{k} } \right)\) and its corresponding radius \(\left( {R_{k} } \right)\). To compare the reference shape and the normalized object contour, we use several shape indices that are defined based on the relative difference of the radius, continuity and curvature. Radius index is the distance between the origin and the edge of the object contour and is defined as \(I_{1,k} = R_{k} .\) Continuity index is the measure of difference between the radiuses of any two adjacent equiangular positions and is given by \(I_{2,k} = R_{k + 1} - R_{k}\). Curvature index is a second-order derivative that follows the finite difference of the object contour at a given equiangular location and is defined as \(I_{3,k} = R_{k - 1} - 2R_{k} + R_{k + 1}\). Area index is the surface area of the projected object and is denoted as \(I_{4}\). Finally, aspect ratio is either the maximum breadth or length and is denoted as \(I_{5}\). These five shape indices are calculated for both the reference shape \({\text{IR}}_{i,k}\) and the object under consideration \(I_{i,k}\).

Fig. 5
figure 5

Illustration of shape transformation

Animals have inherent variations along each of the equiangular position. To invalidate this, we consider the difference between \(\left( {I_{i,k} } \right)\) and \(\left( {{\text{IR}}_{i,k} } \right)\) with respect to the standard deviation of the reference shape \(\left( {{\text{SDR}}_{i,k} } \right)\) and it is defined as

$$D_{{i,k}} = \left\{ {\begin{array}{*{20}l} {\frac{{\left| {I_{{i,k}} - {\text{IR}}_{{i,k}} } \right|}}{{{\text{SDR}}_{{i,k}} }};} & {D_{{i,k}} > T} \\ {0;} & {D_{{i,k}} \le T} \\ \end{array} } \right.$$
(8)

where \(i\) denotes the \(i{\text{th}}\) shape feature, \(k\) the \(k{\text{th}}\) equiangular position and \(T\) the threshold. In the experiments, we set \(T\) as 1, so as to have a forbearance of 1 standard deviation around the reference shape contour. This has excluded minor object contour variations that arise due to inhomogeneity.

2.3 Fuzzy Inference System (FIS) for Animal Detection and Recognition

Fuzzy logic provides a flexibility with which human reasoning capability can be effectually applied to an artificial knowledge system. FIS has four major components namely fuzzification, rule base and defuzzification.

2.3.1 Fuzzification

In this step, we chose the inputs and their appropriate membership functions (MF), which is a mapping between the input \(\left( x \right)\) and the membership value \(f\left( x \right)\). IF-FUZ have five inputs namely higher and lower order Zernike features, length of skeleton path, texture, and shape feature. The higher order Zernike feature is more accurate than lower order; however, it is prone to noise. Hence, we chose an optimal value for both higher (12th) and lower (4th) order. Of these 5 inputs, skeleton path and shape feature have triangular MF and the remaining has trapezoidal MF. Both the output variables utilizes the Gaussian MF, since it is non-zero and symmetric in all the points. We have defined two output variables namely the animal class and its category. Figures 6 and 7 depict the MFs of input and output variables, respectively.

Fig. 6
figure 6

Input membership function. a Zernike higher order, b texture, c skeleton path, d shape feature

Fig. 7
figure 7

Output membership function of animal class and their category

Each of the inputs has three fuzzy sets. The first output variable (animal class) has 12 fuzzy sets representing the different animals and the second variable (category) has three fuzzy sets namely domestic, wild and others.

2.3.2 Fuzzy Rule Base

Fuzzy rule base is a set of verbal description based on the “if… then” rules. The fuzzy system arrives at a conclusion based on the rules we have defined. IF-FUZ has 48 rules as listed in Table 2. The rules were formulated based on the values obtained from the feature extraction techniques. The features were extracted and fixed on a range after normalizing the values. Upon few error and trials, the range for each of the features is defined properly.

Table 2 Fuzzy rules for animal detection and recognition

2.3.3 Defuzzification

Defuzzification essentially quantifies the fuzzy information into numerical value. We finalized on using Centroid of Defuzzification (COD) after experimenting with various other techniques. Sample result is depicted in Fig. 8.

Fig. 8
figure 8

Defuzzification of elephant with the category as wild animal

3 Experimental Results and Discussion

A part of the study was conducted in Vellore (12.9165° N, 79.1325° E), Tamil Nadu. The photos were taken all through the day to analyze the effects of thermal images during various times of the day. Few animals like fox, monkey and deer were captured from Amirthi Zoological Park, Vellore. The dataset with 1862 thermal images has 12 different animal classes that are clubbed into three different categories namely domestic, wild, and others. Under the wild category, animals such as lion, tiger, cheetah, and elephant were taken from the web. Except for these 4 classes, all other animals were manually captured with thermal camera. We used Forward Looking InfraRed (FLIR) thermal camera of model e40 [20]. FLIR e40 that works on Long-Wave Infrared Band (LWIR) has an IR resolution of 160 × 120 with a thermal sensitivity of 0.07° and temperature range of − 20° to 650°.

Sample images from the dataset and the challenging image conditions from camera traps are shown in Figs. 9 and 10, respectively.

Fig. 9
figure 9

Sample images from thermal dataset

Fig. 10
figure 10

Challenging image conditions

3.1 Baseline Results

For the 1862 thermal images, a total of 46,550 detection and recognition attempts were made; where 25 volunteers assessed the 1862 images each. The 25 volunteers included 5 domain experts, 10 experienced volunteers and 10 novices. The detection attempts of domain experts were set as gold standard data and were used to cross-verify the results of all other volunteers. In sum, 66% of the attempts were correct; however, there was a substantial difference between various animal species. In case of elephant, 94% of attempts were correct, whereas only 37% of attempts were correct in the case of monkeys.

3.2 Performance Analysis of IF-FUZ

IF-FUZ is meticulously assessed with few standard performance measures like accuracy, precision and recall rate [21]. As the dataset is unbalanced, we also consider the F-measure, which is a weighted average value of Precision and Recall. The performance analysis of IF-FUZ on our thermal dataset is presented in Fig. 11.

Fig. 11
figure 11

Performance analysis of IF-FUZ on thermal dataset

IF-FUZ achieved an average accuracy of 97%, which is equivalent to the gold standard data produced by domain experts. In terms of precision, IF-FUZ had a positive predictive value of 91.4, which implies the model has been successful in recognizing the animals in most of the cases. The recall rate of the model is around 98%. We had 98.4% of weighted average value, thus signifying a good precision and recall rate.

IF-FUZ has used several feature descriptors and each of them has certain invariance property. To understand the importance of each features, we first tested the individual feature descriptors on the fuzzy model and then a fusion of them was tested. The results for the same are presented in Table 3. It could be inferred from the results that LBP and skeleton features are not sufficient for recognizing animals, whereas Zernike and shape features have achieved considerable performance and proved to be the foremost features necessary for accurate recognition of animals. Nevertheless, the fusion of all four feature descriptors has the highest accuracy of 97% implying the necessity of all four feature sets. The best results are highlighted in boldface from Table 3 to Table 5.

Table 3 Influence of discrete features and their fusion

One of the common issues in camera trap images is animals being captured too far away or too close to the camera. To study this issue, we deliberately captured the animals from varying distance in the distance range of 5 to 50 m. The effect of distance from thermal camera is illustrated in Fig. 12.

Fig. 12
figure 12

Effects of distance in recognizing animals

The number of incorrect results was quite high for monkeys, as they were mostly captured within dense vegetation. Hence, distance solely does not affect the recognition rate, but distance coupled with vegetation affects the recognition. In a yet another experiment, the animal detection from challenging images was studied with both volunteers and IF-FUZ (see Fig. 13).

Fig. 13
figure 13

Performance comparison in detecting animals from challenging images

IF-FUZ detected animals accurately from almost all the challenging images, except in self-occluded and partially captured images, whereas volunteers found it difficult to detect animals from images having illumination and resolution problem.

3.3 Comparative Study

For the baseline comparison, we compared IF-FUZ with other fuzzy models on our thermal dataset (see Table 4). Among all, the Neuro-fuzzy models [15, 16] achieved substantial performance when compared to other systems. However, with carefully chosen invariant features, IF-FUZ achieved the best results.

Table 4 Comparison with existing fuzzy models

Camera traps images have several challenging image conditions that make the animal detection a difficult task. Here, we have represented 10 most common challenges in camera traps and identified the best feature for each of the challenges. We also compared the performance of animal–vehicle detection systems [6, 7] with IF-FUZ on the thermal dataset and the result is presented in Table 5.

Table 5 Accuracy of discrete features in detecting animals from challenging images

As inferred from the table, for challenges that arises due to lighting (illumination problem, over-exposure, or low resolution images), the illumination invariant LBP feature was most appropriate in detecting the animal. Nevertheless, the fusion of features has higher confidence in the detection. We also compared the performance of other related systems [6, 7] in detecting animals from challenging images. A sample image for each challenging condition in Table 5 is represented in Fig. 10. We also studied how far animal detection is efficient with thermal and RGB images. As part of the study, we compared IF-FUZ with RGB-based animal detection systems [18, 22] and the result is depicted in Fig. 14.

Fig. 14
figure 14

Animal detection in RGB vs. Thermal images

As inferred from the figure, the animal detection was lot easier and accurate with thermal images than RGB, although the contrast between animal and the background in the thermal images is poor. This signifies that thermal images are most appropriate for capturing animals than RGB. Among the compared methods, IF-FUZ had accurate results, whereas the compared systems produced FPs and FNs, due to the choice of image modal.

3.4 Discussion

The results from the experiments validate the suitability of thermal images for animal-based applications. Animals are different from humans and the model should be designed catering to the characteristics of animals. Evidently, invariant features effectively detect and recognize the animals even from challenging image conditions. Hence, the features should be chosen to adapt these characteristics of animals. With thermal images, we can ignore the camouflage due to dense vegetation. Although the model had good precision, the detection was quite difficult during day time as the surroundings get heated up due to the soaring temperature and it becomes difficult in differentiating the foreground and background in certain images. For such images, the pre-processing had little effect and so we conclude that thermal images are most useful during pre-dawn, dusk and night time. It is also worth noting that resolution decreases with broader FOV. The proposed model is complementary to other animal warning systems and IF-FUZ can be used as a non-audible alert to forest officials to take necessary actions. Fuzzy constraints [23] and Neuro-fuzzy techniques [24] can be combined to make further intelligent models. Similarly, thermal and visible images [25] can be fused to obtain complementary information from both the image modalities and can overcome the drawbacks from individual images. Besides, thermal images can be employed in aerial animal detection applications [26] for monitoring animals over a vast area. 

4 Conclusion and Future Scope

In this work, we proposed a robust animal detection and recognition system using fuzzy logic and thermal images. The proposed model entitled IF-FUZ encompasses the most appropriate invariant features, without overloading the fuzzy inference system. Through experimental results it is inferred that the invariant features used in the fuzzy system are efficient in detecting animals even from the most challenging image conditions. Under most of the circumstances, the detection rate was close to 100% and the recognition rate was around 98%. Besides, the proposed model holds good for any animal-oriented applications. As part of the future work, the model has to be fine-tuned to detect animals from RGB images as well, as they can be helpful during day time. Alternatively, fusion of visible and thermal images can be studied for detecting animals. In addition, the model has to be tested under various altitude and vegetation, to better understand the effects of thermal imaging.