1 Introduction

Visual recognition has posed a significant challenge to the research community of computer vision due to interclass variability, e.g., illumination, pose, and inclusion. The rich context of an image makes the semantic understanding (object, pattern recognition) very difficult. Although the community has spent a lot of effort on image-categorization classification [14], which leads to very powerful intermediate representation of images such as the bag-of-feature (BoF) model [5, 6], it still has some space to improve the recognition performance in some specific recognition applications such as texture, food image, and biomedical data. The main idea of the popular BoF model is to quantize local invariant descriptors, for example, obtained by some interest point detector techniques [7, 8] and a description with SIFT [9], into a set of visual words [6]. The frequency vector of the visual words then represents the image, and an inverted file system is used for efficient comparison of such BOFs. Therein, the local descriptors for microstructure representation are often untouched, using standard, expert-provided ones such as the SIFT [9], PCA-Sift [10], and HOG [11], which result in a similar performance for image recognition. Further, since the learned codebook in BOF model has of high diversity and irregularity, it is difficult to integrate the context information of co-occurrence for image representation.

On the other hand, microstructure simply using pixel neighborhoods as small as 3×3 pixels for local pattern representation has been proven to be possibly powerful for discriminating texture information, where the most popular one is a local binary pattern (LBP). LBP [1214] characterizes each 3×3 local patch (microstructure) into a binary series by comparing the surrounding pixel intensity with the center one, which sets the bit of the surrounding pixel as 1 if its intensity is lager than the center one, otherwise 0. LBP is robust against uniform changes, and also easy for extension owing to the regularly decimated levels for all neighborhood pixels. However, due to the lack of spatial relationships among local textures, there have still serious disadvantages in the original LBP representation. Therein, an extension of the LBP called CoLBP [15, 16], has been proposed by considering the co-occurrence (spatial context) among adjacent LBPs, which proved promising performances on several classification applications [6]. In addition, by integrating orientation context, Nosaka et al. [17, 18] explored a rotation invariant co-occurrence among LBP, which is shown to be more discriminant on image classification compared to CoLBP. Qi et al. [19] proposed a pairwise rotation invariant CoLBP for texture representation and achieved the promising recognition performances in several visual recognitions.

Although LBP-based descriptors showed promising performances compared to other feature representations, it is obvious that LBP only thresholds the differential values between neighborhood pixels and the focused one to 0 or 1, which is very sensitive to noise existing in the processed image. Tan et al. extended LBP to local ternary pattern (LTP) [20], which considers the differential values between neighborhood pixels and the focused one as no or negative/positive stimulus, and successfully applied for face recognition under difficult lighting conditions. Given a pre-set positive threshold η, LTP [20] can obtain a series of ternary values for local pattern representation. However, regardless of the magnitude of the focused pixel, the pre-set threshold η remains fixed, which would violate the principle of human perception. Therefore, following the fact that human perception of a pattern depends not only on the absolute intensity of the stimulus but also on the relative variance of the stimuli, we propose to quantize the ratio between the neighborhood and the center pixels, which is equivalent to adaptively (data-driven) decide the quantization point according to the magnitude of the focus pixel. This proposed quantization strategy is inspired by Weber’s law, a psychological law [21], which states that the noticeable change of a stimulus such as sound or lighting by a human being is a constant ratio of the original stimulus. When the stimulus has small magnitude, small change can be noticeable. Thus, we propose a data-driven quantized local ternary pattern (WLTP) based on Weber’s law, which gives the activation status of the neighborhood pixels by data-driven thresholding the noticeable change according to the stimulus of the focus pixel: positively activated (magnitude: 1) if the ratio between the stimulus change and the focused one is larger than a constant η, negatively activated (magnitude: −1) if the change ratio is smaller than −η, not activated (magnitude: 0) otherwise. By incorporating a co-occurrence (here, i.e., spatial and orientation) context, we extend the proposed WLTP to a rotation invariant co-occurrence WLTP (RICWLTP), which is robust to image rotation and has the high descriptive ability among WLTP co-occurrences. Compared with the-state-of-the-art methods, our proposed strategy can achieve much better performance results for several visual recognitions including two texture datasets and one food image dataset.

2 Related work

In computer vision, local descriptors (i.e., features computed over limited spatial support) have been proved well adapted for matching and recognition tasks, as they are robust to partial visibility and clutter. The current popular one for local descriptors is SIFT feature, which is proposed in [9]. With the local SIFT descriptor, usually there are two types of algorithms for object recognition. One is to match the local points with SIFT features in two images, and the other one is to use the popular bag-of-feature model (BOF), which forms a frequency histogram of a predefined visual words for all sampled region features [6]. However, the extraction of SIFT descriptor itself is time-consuming, and then matching a large amount of them or quantizing them into a histogram in a large number of visual words, which are generally needed in BOF model for acceptable recognition of images, also exhausts a lot of time. On the other hand, some research works [22, 23] also have shown that it is possible to discriminate between texture patterns using pixel neighborhoods as small as 3×3 pixel region, which demonstrated that despite the global structure of the images, very good discrimination could be achieved by exploiting the distributions of such pixel neighborhoods. Therefore, exploiting such microstructures for representing images in the distributions of local descriptors has gained much attention and has led to state-of-the-art performances [2428] for different classification problems in computer vision. The basic one of these approaches is local binary patterns (LBPs) introduce by Ojala et al. [22] as a means of summarizing local gray-level structures. As noted above, LBP is a simple yet efficient texture operator that labels pixels of an image by establishing thresholds for the neighborhood of each pixel based on the value of the central pixel; the result is a binary number associated with each neighborhood pixel. For image representation, LBP index histogram is generally calculated as feature. Unfortunately, because of the packed single histogram, spatial relations among the LBPs are mostly discarded; thus, it results in the loss of global image information. Motivated by the co-occurrence concept such as in Co-HOG [2931] and joint haar-like features [30], Nosaka et al. [17, 18] proposed to integrate context information (i.e., spatial and orientation) into the conventional LBP for achieving high descriptive capacity in image representation; the similar extension of LBP for integrating context information also can be seen in the recent work [19].

However, the LBP-based descriptors only set differential values between neighborhood pixels and the focused pixel to zero or one, and then it has high sensitivity to noise in the processed image that in turn degrades discriminant of image representation. Thus, as noted above, Tan et al. extended LBP to a local ternary pattern (LTP) approach [20], which considers differential values between neighborhood pixels and the focused pixel as either a negative/positive stimulus or no stimulus whatsoever. Next, the series of ternary values are combined into an LTP index. Given intensities [I(x),I(x+△x 1),⋯,I(x+△x l ),⋯,I(x+△x L−1)] of focused pixel x and its L neighbors x+△x l (displacement vector), LTP thresholds differential values [I(x+△x)0I(x),⋯,I(x+△x l )−I(x),⋯,I(x+△x L−1)−I(x)] as

$$ G(I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x}))=\left\{ \begin{array}{ccc} 1 & I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x})> \eta \\ -1 & I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x})<- \eta\\ 0 & \text{otherwise} \end{array}\right. $$
(1)

where η is the pre-set constant for thresholding the differential values. Then, the LTP index at x is defined as

$$ LTP(\mathbf{x})=\sum\limits_{l=0}^{L-1} \left[G(I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x}))+1\right]3^{l} $$
(2)

In the above equation, regardless of the magnitude of the focused pixel, the pre-set threshold η in LTP remains fixed, which violates the principle of human perception.

3 Data-driven quantized LTP and co-occurrence context

3.1 Weber’s law

Ernst Heinrich Weber, an experimental psychologist in the nineteenth century, approached the study of the human response to a physical stimulus in a quantitative fashion and observed that the ratio of the increment threshold to the background intensity is a constant [32]. This observation shows that the just noticeable difference (JND) between two stimuli is proportional to the magnitude of the stimuli, which is well known as Weber’s law and can be formulated as:

$$ \frac {\triangle I} {I}=a $$
(3)

where △I denotes the increment threshold (just noticeable difference for discrimination) and I denotes the initial stimulus intensity; a is known as the weber fraction, which indicates that the proportion on the left hand of the equation remains constant in spite of the variance in I. Simply speaking, Weber’s Law states that the size of the just noticeable difference is a constant proportion (a times) of the original stimulus value, which is the minimum amount that stimulus intensity must be changed in order to produce a noticeable variation in sensory experience.

3.2 Weber-based LTP: WLTP

According to Weber’s law, the JND of a focused pixel in relation to its neighboring pixels is proportional to the intensity I(x) of the focused pixel. Thus, we quantize different values [I(x+△x 0)−I(x),⋯,I(x+△x l )−I(x),⋯,I(x+△x L−1)−I(x)] between a focused pixel x and its L−1 neighbors {x+△x l }(l=0,1,⋯,L−1) to form a ternary series as follows:

$$ G\left(\frac{I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x})}{I(\mathbf{x})+\alpha}\right)=\left\{ \begin{array}{ccc} 1 & \frac{I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x})}{I(\mathbf{x})+\alpha}> \eta \\ -1 & \frac{I(\mathbf{x}+\triangle\mathbf{x}_{l}) - I(\mathbf{x})}{I(\mathbf{x})+\alpha}< -\eta\\ 0 & \text{otherwise} \end{array}\right. $$
(4)

where η is a predefined constant and α is a constant that avoids the case in which there is zero intensity (i.e., no stimulus); we always set α to one in our experimentation. Equation (3) adaptively quantizes differential values between the focus pixel and its neighboring pixels into a series of ternary codes; the WLTP index at x is defined as

$$ \text{WLTP}(\mathbf{x})=\sum\limits_{l=0}^{L-1} \left[G\left(\frac{I\left(\mathbf{x}+\triangle\mathbf{x}_{l}\right) - I(\mathbf{x})}{I(\mathbf{x})+\alpha}\right)+1\right]3^{l} $$
(5)

In general LBP, neighborhood pixel number L is usually set to 8. Due to the detailed quantization (i.e., the ternary representation instead of binary), L is set to 4 to reduce computational cost. The lth displacement vector △x l is formulated as △x l =(r cos(θ l ),r sin(θ l )), where \(\theta _{l}=\frac {360^{o}}{L}l\) and r is the scale parameter (i.e., the distance from the neighboring pixels to the focused pixel) of WLTP. As a result, as shown in Fig. 1, WLTPs have N P =81 (= 3L) possible patterns.

Fig. 1
figure 1

The procedure of the proposed WLTP extraction

3.3 Integration of spatial and orientation contexts

LTP and WLTP discard information regarding spatial relationships between adjacent patterns; such information would be crucial to describing texture information for images. In this study, we first integrate the spatial context between the adjacent LTPs and WLTPs via co-occurrence information (called as CoLTP and CoWLTP, respectively). Without losing generalization, we describe the strategy of context integration for the proposed WLTP, while the same procedure can also be feasible by replacing WLTP with LTP. To obtain statistics of the co-occurrence (i.e., the spatial context) between two adjacent WLTPs, we consider the N P ×N P auto-correlation matrix defined as:

$$\begin{array}{@{}rcl@{}} H_{p,q}^{\varphi}=&\sum\limits_{\mathbf{x}\subset \mathbf{I}} \delta_{p,q}\left(\text{WLTP}(\mathbf{x}), \mathrm{WLTP(}\mathbf{x}+\triangle\mathbf{x}_{\varphi})\right)\\ &\delta_{p,q}(z_{1},z_{2}) = \left\{ \begin{array}{ll} 1 &\textrm{if \(z_{1}=p\) and \(z_{2}=q\)}\\ 0 & \text{otherwise} \end{array} \right. \end{array} $$
(6)

where p,q(=[0,1,⋯,N P −1]) are possible pattern indexes for the two adjacent WLTPs and φ is the angle that determines the positional relations between the two WLTPs, which formulates displacement vector △x φ =(d cosφ,d sinφ) with interval d. Then, co-occurrence matrix dimension is 6561 (=N P ×N P ).

Further, because of the possible different imaging viewpoints, rotation invariant (i.e., orientation context) is generally an indispensable characteristic for texture image representation. Thus, we also integrate orientation context among adjacent WLTPs and propose a rotation invariant co-occurrence WLTPs that would contribute much higher descriptive capability for image representation. We first denote two pairs of WLTP patterns, \(P_{\varphi =0}^{\text {WLTP}}=[\text {WLTP}(\mathbf {x}), \text {WLTP}(\mathbf {x}+\triangle \mathbf {x}_{\varphi =0})]\) and \(P_{\varphi }^{\text {WLTP}}=[\text {WLTP}^{\varphi }(\mathbf {x}), \ \text {WLTP}^{\varphi }(\mathbf {x}+\triangle \mathbf {x}_{\varphi })]\), where WLTP(x) gives the 4-b clockwise ternary digits with the first digit in the right-horizontal direction (φ=0), and WLTP(x+△x φ=0) is the co-occurrence WLTP in the φ=0 direction; WLTPφ(x) and WLTPφ(x+△x φ ) indicate the rotated entire WLTP pair with rotation angle φ. Thus, the rotation invariant statistics can be formulated if we assign the same index to \(P_{\varphi }^{WLTP}\) regardless of the different rotations designated by φ. Because we only used four neighbors of the focused pixel, only four rotation angles (i.e., φ=0°,90°,180°,270°) are available for computing rotation invariant statistics as shown in Fig. 2 (i.e., four equivalent WLTP pairs). According to the assigned labels for rotation invariant WLTP, the valid co-occurrence patterns can be reduced from 6561 to 2222. To efficiently calculate the rotation invariant co-occurrence of WLTP, we use mapping table M according to the algorithm shown in Table 1; the algorithm is to generate mapping table that converts a WLTP pair to an equivalent rotation index. In Table 1, shift((p)3,l) means circle-shift l bits of the transformed ternary digits of p. With the calculated mapping table, statistics of the rotation invariant co-occurrence WLTP can be formulated as

$$\begin{array}{@{}rcl@{}} {}&&H^{RI}_{M(\text{Index})}\\ {}&&= \sum\limits_{\mathbf{x}\subset \mathbf{I}} \bigcup\limits_{\varphi} \delta^{\text{index}}\left[M (\text{WLTP}^{\varphi}(\mathbf{x}), \text{WLTP}^{\varphi}(\mathbf{x}+\triangle\mathbf{x}_{\varphi}))\right]\\ {}&&\delta^{\text{index}}(z) = \left\{ \begin{array}{ll} 1 &\textrm{if \(z=\text{index}\)}\\ 0 & \text{otherwise} \end{array} \right. \end{array} $$
(7)
Fig. 2
figure 2

The four equivalent WLTP pairs

Table 1 Calculation of mapping table between the WLTP pair and the rotation equivalent index

where M(WLTPφ(x),WLTPφ(x+△x φ )) is the index of the rotation invariant co-occurrence between two WLTP pairs (RICWLTP). Finally, the statistics (i.e., histogram) of the RICWLTP can be used for discriminated representations of images. Similarly, if we formulate the Eq. (6) based on LTP instead of WLTP, the index of the rotation invariant co-occurrence between two LTP pairs (RICLTP) can be formed and then the statistics (i.e., histogram) of the RICLTP can be used for image representations.

3.4 Dimension analysis of LBP/LTP-based statistics

In general LBP, the used neighborhood pixel number L is usually set to 8, which results in 256 (28: 8 binary series) LBP indexes and the possible pattern value for any 3×3 local patch is ranged in [0,255]. Thus, the LBP histogram for image representation has the dimension 256. The direct context integration considering the index combination of the adjacent LBP pairs would intuitively lead to a 256×256 co-occurrence matrix, where each element denotes the number of the LBP pairs with the same index combination in the image, and the reformed vector of the co-occurrence matrix is used as image representation with 65,536 (2562) dimension. The high dimension of image representation results in high computational cost for the following classification procedure. Therefore, Nasaka et al. [17, 18] only adopted 4 (L=4) neighborhood pixel number for producing 16 (24) LBP indexes, and then explored the co-occurrence statistics of LBP pairs (Co-LBP) and rotation invariant co-occurrence (RICLBP), which generated very compact LBP-based descriptors 256 ((24)2) and 136 dimensional features, respectively. On the other hand, although Qi et al. [19] set the neighborhood pixel number L as 8, not all 256 LBP indexes are used for exploring co-occurrence of LBP pairs to avoid high-dimensional feature. In [19], the authors integrated the low-frequency appeared LBP indexes into one group for retaining a small number of LBP indexes (uniform LBP: ULBP, 59 patterns), and then further produced the rotation invariant uniform LBP (RIU-LBP, only 10 patterns) as the basic pair patterns. The co-occurrence matrix is formed by counting the number of the adjacent ULBP and RIU-LBP pairs with the combination indexes, which produces the matrix with a size of 59×10. The reformed vector as image representation is 590 in dimension and in [19], further integrating two different configurations of pair combination that generated 1180-dimensional features.

In (W)LTP-based descriptors, if the neighborhood pixel number L is set as 8, the number of LTP indexes is 6561 (38: 8 ternary series). In addition, the context integration considering the index combination of the adjacent (W)LTP pairs would generate an extremely large size of co-occurrence matrix (656×6561), which results in a 43,046,721-dimensional feature and is infeasible for post-processing. Thus, this study explores four neighborhood pixels for (W)LTP-based descriptors. For the simple histogram of the LTP and WLTP indexes (34: 4 ternary series), 81-dimensional image feature can be obtained, and the co-occurrence statistics (called CoLTP and CoWLTP) have the dimension 6561 (812). After considering the orientation-invariant property, our proposed RICLTP and RICWLTP produce a 2222-dimensional feature for image representation. Basically, the computational cost for extracting the LBP- and LTP/WLTP-based descriptors mainly depends on the pattern and pair-pattern index numbers, and thus, high-dimensional (W)LTP-based descriptors would lead to high computational cost. This is also a reason that this study applies only four neighborhood pixels, and it is also possible to explore different neighborhood structures for integrating more context information and multiscale patterns in our proposed RICWLTP, which is left to the future work.

4 Experiments

4.1 Data sets and methodology

We evaluated our proposed framework on two texture datasets and one food image dataset.

(i) Two texture datasets: the first one is Brodatz32 [33] that is a standard dataset for texture recognition, and the second one is KTH-TIPS 2a [34], a dataset for material categorization [35]. In Brodatz32 dataset, three additional images are generated by (i) rotating, (ii) scaling, and (iii) both rotating and scaling an original sample. We use the standard protocol [36], of randomly splitting the dataset into two halves for training and testing and report average performance over 10 random splits. KTH-TIPS 2a dataset contains 11 materials, e.g., cork, wool, linen, with images of four physical, planar samples for each material. The samples were photographed at nine scales, three poses, and four different illumination conditions. It consists of 4395 images, most of which have the size of 200×200. All these variations on scale pose and illumination make it an extremely challenging dataset. For KTH-TIPS 2a texture dataset, we use the same evaluation protocol [36, 37] and report the average performance over four runs, where every time all images of one sample are taken for a test while the images of the remaining three samples are used for training.

(ii) Food dataset: Pittsburgh fast-food image dataset (PFID) [38], which is a collection of fast food images and videos from 13 chain restaurants, are acquired under lab and realistic settings. In our experiments, we focus on the set of 61 categories of specific food items (e.g., McDonald’s Big Mac) with masked background. Each food category contains three different instances of the food (bought on different days from different branches of the restaurant chain) and six-viewpoint images (60° apart) of each food instance. We follow the experimental protocol in the published work [39] and perform threefold cross-validation for our experiments, using the 12 images from two instances for training and the 6 images from the third for testing. This procedure is repeated three times by using a different instance serving as the test set, and average performance is calculated as the results. The protocol ensures that no image of any given food item ever appears in both the training and test sets and guarantees that food items were acquired from different restaurants on different days. Two standard baseline algorithms, color histogram + SVM and bag of features with SIFT as the local descriptors + SVM, are shown for compared evaluation. In the baseline algorithm, the standard RGB color histogram with four quantization levels per color component is extracted to generate a 43=64-dimensional representation for a food image, and a multi-class SVM is applied for classification. On the other hand, the BOF strategy combining a SVM classifier uses the histogram (low-order statistics) of visual words (representative words) with SIFT local descriptors. Recently, Yang et al. proposed Statistics of Pairwise Local Features (SPLF) [39] for food image representation, and the evaluation on PFID dataset showed the best performance on the state-of-art methods, which are also as the compared results to our strategy.

4.2 Experimental results

We investigate the recognition performances using the statistics of our proposed WLTP, co-occurrence (i.e., spatial and orientation) context integration in WLTP, the corresponding versions without the using of Weber’s law (not data-driven quantization, denote as LTP and RICLTP) and the conventional LBP and its extensions by Nosaka et al. [17, 18] and Qi et al. [19], which also incorporate the co-occurrence context into LBP. After obtaining the LBP/LTP-based features for image representation, we simply use linear support vector machine (SVM) due to its efficiency compared to the nonlinear one, for classification. In addition, we also pre-process the LBP/LTP-based histogram with the square root operation as the following:

$$ \begin{aligned} \mathbf{P}'(\mathbf{I})&= \left[p_{1}'(\mathbf{I}),p_{2}'(\mathbf{I}),\cdots,p_{L}'(\mathbf{I})\right]\\ &=\left[\sqrt{p_{1}(\mathbf{I})},\sqrt{p_{2}(\mathbf{I})},\cdots,\sqrt{p_{L}(\mathbf{I})}\right] \end{aligned} $$
(8)

where [p 1(I),p 2(I),⋯,p L (I)]=P(I) is the raw LBP/LTP/WLTP, context-integrated co-occurrence of LBP as the works by Nosaka et al. and Qi et al., and our proposed features, and L is the dimension of the focused features. P (I) is the pre-processed or normalized histogram, which is used as the input of SVM for classification. With the above normalization, we can enhance some local patterns with low absolute frequency (values in histogram) in an image but large relative difference when comparing two images. Furthermore, a linear SVM with the processed features can also be explained as a nonlinear classification strategy using the raw LBP/LTP histogram, which is equivalent to using Hellinger Kernel in the raw feature space.

As introduced in Section 3, the (W)LTP and their context integration versions, RIC(W)LTP in the proposed framework is formulated by quantization procedure with a predefined (data-driven) threshold η. Figure 3 shows the comparative recognition performances using the features in works by Nosaka et al. and Qi et al., LBP, and the conventional LTP, the proposed WLTP, RICLTP, and RICWLTP with different η on Brodatz32 and KTH-TIP 2a datasets; the LBP-based features (i.e., the works by Nosaka et al. and Qi et al., LBP) have no parameters. The parameters η in the LTP and RICLTP in our experiments are set as 1, 3, 5, and 7, and the ones (η) in the are set as 0.03, 0.05, 0.07, and 0.09. From Fig. 3, we observe that our proposed data-driven quantized versions (i.e., WLTP and RICWLTP) results in much better performance than that (i.e., LTP and RICLTP) with an absolute threshold, regardless of the magnitude of the focused pixel; the best recognition result was achieved by the proposed framework with data-driven quantization and co-occurrence context. Figure 4 gives the comparative recognition accuracies with our proposed frameworks and other state-of-the-art approaches [3537, 4042] on both Brodatz32 and KTH-TIP 2a datasets; the best results were achieved using our proposed approach with data-driven quantization and co-occurrence context. We also implemented the texture feature in [36] called the Weber local descriptor (WLD) for image representation and used the linear SVM with the normalized WLD histograms as Eq. (7) for classification. On both Brodatz32 and KTH-TIP 2a datasets, we extracted the histograms of WLD under different parameters (M=6,T=4,S=3 and M=8,T=8,S=4 as in [36], respectively, without block division for Brodatz32 but with nine block division for KTH-TIP 2a dataset) denoted as “WLD _Ver1” and “WLD _Ver2,” respectively. Figure 4 shows that our proposed RIC(W)LTP methods can give much better performances on both texture datasets than WLDs implemented by [36] and us. All the above experiments were implemented using the linear SVM with the pre-processed LBP/LTP-based histograms by Eq. (8). As analyzed in [43], the linear SVM with the pre-processed image feature using root operator (Eq. (8)) is equivalent to the nonlinear SVM with Helinger kernel [43] on the raw image feature. This simple pre-processing combined with the linear SVM can achieve comparable classification performance with the nonlinear SVM and the raw image feature especially for histogram-based representation. For comparison, we also conducted the experiments using the nonlinear SVM with RBF kernel on both texture datasets. The compared recognition accuracies with RICLTP/RICWLTP and linear/nonlinear SVM are shown in Fig. 5 a, b for Brodatz32 and KTH-TIP 2a datasets, respectively, which manifests that the comparable or a little better performances can be achieved by the nonlinear SVM than the linear one with the pre-processed RICLTP and RICWLTP descriptors. In the RBF kernel nonlinear SVM, there is a parameter γ, which is related to RBF kernel width and may greatly affect the classification performance. According to our experience [44, 45], we set \(\frac {1}{\gamma }\) as the mean value of the pairwise Euclidean distances of all training samples’ features in the nonlinear SVM experiments.

Fig. 3
figure 3

Comparative results using our proposed framework and the conventional LBP-based descriptors by Nosaka et al. [17, 18] and Qi et al. [19]; Horizontal axes denotes the changed parameter with η introduced in Sections 2 and 3 for LTP, WLTP, RICLTP, and RICWLTP. Since there are no parameters in LBP-based descriptors, the recognition accuracies for LBP, the works by Nosaka et al. and Qi et al. remain the same in these graphs. a Bradatz dataset. b KTH-TIP 2a dataset

Fig. 4
figure 4

Comparative results using our proposed framework and the state-of-the-art methods on both Brodatz32 [35, 36, 4042] and KTH-TIP 2a datasets [36, 37]. “WLD _Ver1[35]” and “WLD _Ver2[35]” denote our re-implementations of the WLD features in [36] under different parameters (M=6,T=4, S=3 and M=8,T=8,S=4. “Chen(WLD)” denotes the recognition accuracy directly taken from [36]. a Bradatz dataset. b KTH-TIP 2a dataset

Fig. 5
figure 5

Comparative results using linear SVM with pre-processed RICLTP and RICWLTP descriptors and the nonlinear SVM with RBF kernel on both texture datasets. a Bradatz dataset. b KTH-TIP 2a dataset

Next, we give the comparative results using our propose LTP-based descriptors and the conventional LBP-based descriptors (the features in the work by Nosaka et al. and Qi et al.) on PFID dataset in Fig. 6 a that also shows the promising recognition performances by our strategy. Similar to the experiments for texture datasets, we also set the used parameter η in the LTP and RICLTP for food recognition experiments as 1, 3, 5, and 7 and η in the WLTP and RICWLTP as 0.03, 0.05, 0.07, and 0.09. For PFID database, the baseline evaluations by the color histogram and the BOF model using SIFT descriptor give the recognition rates 11.2 and 9.3%, respectively. In [39], the more promising performances on PFID database are achieved, which use statistics of pairwise local features. However, the proposed features need to firstly segment the different ingredients from the food image, which would cost more computational time. In our work, we apply the statistics of the simple data-driven quantization of the microstructures for food image representation and apply the very efficient linear SVM as classifier. The comparative results with the state-of-the-art methods are shown in Fig. 6 b, where the best performance can be obtained by our proposed strategy. The more complex statistics of pairwise local features (denoted as SPLF) [39] can greatly improve the baseline evaluation (11.2,9.3%) by conventional color histogram (denoted as CH) and the BOF model (denoted as BOF) to nearly three times (28.2%), which was the best performance on PFID in 2010. Our implementations of WLD histograms with different parameters [36] can also give comparable performances with that by SPLF on PFID dataset, and our proposed strategy with data-driven quantization and co-occurrence context then can increase the recognition rate 28.2% with SPLF [39] to more than 36% about 8% improvement.

Fig. 6
figure 6

Comparative results on PFID dataset a with the LBP based descriptors by Nosaka et al. [17, 18] and Qi et al. [19] and b with the state-of-the-art method [39]

5 Conclusions

In this paper, we explored a robust representation strategy of texture images for visual recognition. The widely used local descriptor for texture analysis is local binary pattern, which characterizes each 3×3 local patch (microstructure) into a binary series by comparing the surrounding pixel intensity with the center one, and further has been extended to local ternary patterns via quantizing the difference values between the surrounding pixels and the center one into −1, 0, and 1, which can be explained as the active status (i.e., positively activated, negatively activated, and not activated). However, regardless of the magnitude of the focused pixel, the pre-set threshold η for quantization in the conventional LTP remains fixed, which would violate the principle of human perception. Motivated by the fact that human perception of a pattern depends not only on the absolute intensity of the stimulus but also on the relative variance of the stimuli, we proposed a novel local ternary pattern (LTP) with data-driven quantization according to Weber’s law (called WLTP), which is equivalent to adaptively (data-driven) decide the quantization point according to the magnitude of the focus pixel. This proposed quantization strategy is inspired by Weber’s law, a psychological law, which states that the noticeable change of a stimulus such as sound or lighting by a human being is a constant ratio of the original stimulus. Further, we incorporated the spatial and orientation context into WLTP and explored the rotation invariant co-occurrence among WLTP that has much higher descriptive capability than that of conventional LBP-based descriptors for image representation. Our experiments on two texture datasets and one food dataset confirmed that our proposed strategy greatly improved recognition performance as compared to the LBP-based descriptors and other state-of-the-art approaches.