Semantic Texton Forests for Image Categorization and Segmentation

Johnson, M.; Shotton, J.; Cipolla, R.

doi:10.1007/978-1-4471-4929-3_15

M. Johnson³,
J. Shotton⁴ &
R. Cipolla⁵

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

7674 Accesses
4 Citations

Abstract

Semantic texton forests (stfs) are a form of random decision forest that can be employed to produce powerful low-level codewords for computer vision. Each decision tree acts directly on image pixels, resulting in a codebook that bypasses the expensive computation of filter-bank responses or local descriptors. Further, stfs are extremely fast to both train and test, especially when compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the stfs provide both an implicit hierarchical clustering into semantic textons, and also an explicit pixel-wise local classification estimate. In this chapter we (i) investigate stfs as learned visual dictionaries; (ii) show how stfs can be used for both image categorization and semantic segmentation by aggregating hierarchical bags of semantic textons; (iii) demonstrate that stfs allow us to exploit semantic context in segmentation; and (iv) show how a global image-level categorization can be used as a prior to improve the accuracy of semantic segmentation. We also see that the efficient tree structures of stfs allow at least a five-fold increase in execution speed over competing techniques.

This work was undertaken while the first two authors were at the University of Cambridge and Toshiba Corporate Research and Development Center respectively.

Access provided by Autonomous University of Puebla. Download chapter PDF

A Powerful and Efficient Method of Image Segmentation Based on Random Forest Algorithm

Semantic Segmentation with Millions of Features: Integrating Multiple Cues in a Combined Random Forest Approach

Fast and Scalable Image Retrieval Using Predictive Clustering Trees

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Semantic texton forests (stfs) are a form of random decision forest that can be employed to produce powerful low-level codewords for computer vision. Each decision tree acts directly on image pixels, resulting in a codebook that bypasses the expensive computation of filter-bank responses or local descriptors. Further, stfs are extremely fast to both train and test, especially when compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the stfs provide both an implicit hierarchical clustering into semantic textons, and also an explicit pixel-wise local classification estimate. In this chapter we (i) investigate stfs as learned visual dictionaries; (ii) show how stfs can be used for both image categorization and semantic segmentation by aggregating hierarchical bags of semantic textons; (iii) demonstrate that stfs allow us to exploit semantic context in segmentation; and (iv) show how a global image-level categorization can be used as a prior to improve the accuracy of semantic segmentation. We also see that the efficient tree structures of stfs allow at least a five-fold increase in execution speed over competing techniques.

1 Introduction

This chapter discusses semantic texton forests, and demonstrates their use for image categorization and semantic segmentation; see Fig. 15.1. Our aim is to show that one can build powerful texton codebooks without computing expensive filter-banks or descriptors, and without performing costly k-means clustering and nearest-neighbor assignment. Semantic texton forests (stfs) achieve both these goals. stfs are randomized decision forests that use only simple pixel comparisons on local image patches, and output both an implicit hierarchical clustering into semantic textons and an explicit local classification of the patch category.

We look at two applications of stfs: image categorization (inferring the object categories present in an image) and semantic segmentation (dividing the image into coherent regions and simultaneously categorizing each region). To these ends, we propose the bag of semantic textons (bost). The bost is computed over a given image region, and extends the bag of words model [81] by combining a histogram of the hierarchical semantic textons with a region prior category distribution. For categorization, we obtain a highly discriminative descriptor by considering the image as a whole. For segmentation, we use many local rectangular regions and build a second randomized decision forest that achieves efficient and accurate segmentation.

Inferring the correct segmentation depends on local image information that can often be ambiguous. The global statistics of the image, however, are more discriminative and may be sufficient to accurately estimate the image categorization. We therefore investigate how an SVM-based image categorization can act as an image-level prior to improve segmentation: the classification output of the SVM is used as a prior to emphasizing the categories most likely to be present given the global appearance of the image.

To summarize, the main topics in this chapter are: (i) semantic texton forests which efficiently provide both a hierarchical clustering into semantic textons and a local classification; (ii) the bag of semantic textons model, and its applications in categorization and segmentation; (iii) how stfs allow us to exploit semantic context for segmentation; and (iv) the use of the image-level prior to improve segmentation accuracy.

1.1 Related Work

Textons [176, 229, 381] and visual words [348] have proven powerful discrete image representations for categorization and segmentation [81, 342, 404, 416]. Filter-bank responses (derivatives of Gaussians, wavelets, etc.) or invariant descriptors (e.g. SIFT [225]) are computed across a training set, either at sparse interest points (e.g. [248]) or more densely; results in [267] suggest that densely sampling visual words improves categorization accuracy. The collection of descriptors are then clustered to produce a codebook of visual words, typically with the simple but effective k-means, followed by nearest-neighbor assignment. Unfortunately, this three stage process is extremely slow and often the most time consuming part of the whole system, even with optimizations such as kd-trees, the triangle inequality [97], or hierarchical clusters [266, 321].

The work of Moosmann et al. [253] proposed a more efficient alternative, in which training examples are recursively divided using a randomized decision forest [5, 44, 128] and where the splits in the decision trees are comparisons of a descriptor dimension to a threshold. With semantic texton forests, we extend [253] in three ways: (i) we learn a codebook that acts directly on image pixels, bypassing the expensive step of computing image descriptors; (ii) while [253] use the learned decision forest only for clustering, we also use it as a classifier, which enables us to use semantic context for image segmentation; and (iii) in addition to the leaf nodes used in [253], we include the split nodes as hierarchical clusters. A related method, the pyramid match kernel (PMK) [141], exploits a hierarchy in descriptor space, though the PMK requires the computation of feature descriptors and is primarily applicable only to kernel-based classifiers. The pixel-based features we use are similar to those in [214], but our forests are trained to recognize object categories, not to match particular feature points.

Other work has also looked at alternatives to k-means. The work of [376] quantized feature space into a hyper-grid, but required descriptor computation and can result in very large visual word codebooks. Winder and Brown [402] learned the parameters of generic image descriptors for 3D matching, though did not address visual word clustering. Jurie and Triggs [177] proposed building codebooks using mean shift, but did not incorporate semantic supervision in the codebook generation.

2 Randomized Decision Forests

We begin with a brief review of randomized classification forests [5, 128]. We follow the notation and terminology introduced in Chaps. 3 and 4 as closely as possible. A decision forest is an ensemble of T decision trees. Associated with each node j in the tree is a learned class distribution p _j(c). A decision tree works by recursively branching left or right down the tree according to a series of learned binary functions computed at 2D pixel position u=(u _x,u _y), until a leaf node l is reached. The whole forest achieves an accurate and robust classification by averaging the class distributions over the leaf nodes $\mathcal{L}(\mathbf{u}) = \{ l_{t}(\mathbf{u}) \}_{t=1}^{T}$ reached for all T trees:

$$ p(c | \mathbf{u}) = \frac{1}{T}\sum_{l \in\mathcal{L}(\mathbf {u})} p_l(c). $$

(15.1)

Existing work has shown the power of decision forests as either classifiers [36, 214] or a fast means of clustering descriptors [253]. In this chapter we show how to simultaneously exploit both classification and clustering. Furthermore, we generalize [253] to use the tree hierarchies as hierarchical clusters.

We use the standard randomized learning algorithm described in Chap. 4 to learn binary forests. Each tree is trained separately on a small random subset $\mathcal{S}' \subseteq \mathcal{S}$ of the training data $\mathcal{S}$ (here we employ the bagging randomness model). We will denote the weak learner decision function as

$$ h(\mathbf{u}; \boldsymbol{\theta}_j) = \bigl[ f(\mathbf {u}; \boldsymbol{\phi}_j) \ge\tau_j \bigr] $$

(15.2)

which is governed by node-specific parameters θ _j=(ϕ _j,τ _j) consisting of offset parameters ϕ (see below) and a threshold τ. Learning proceeds as described in Chap. 4, using the standard entropy-based information gain objective. The training continues to a maximum depth D or until no further information gain is possible. The class distributions p _j(c) are estimated empirically as a histogram of the class labels c(u) of the training examples $\mathbf{u}\in\mathcal {S}_{j}$ that reached node j.

The amount of training data may be significantly biased towards certain classes in some datasets. A classifier learned on these data will have a corresponding prior preference for those classes. We weight each training example by the inverse class frequency as w(u)=ξ _c(u) with $\xi_{c} = ( \sum_{\mathbf{u}\in \mathcal{S}} [c = c(\mathbf{u})] )^{-1}$. This weight is applied to each example when accumulating the histograms used to compute the information gain. The classifiers trained using this weighting tend to give a better class average accuracy.

Using ensembles of trees trained on only small random subsets of the data helps to speed up training time and reduce over-fitting [5]. The trees are fast to learn and extremely fast to evaluate since only a small portion of the tree is traversed for each data point. After training, an improved estimate of the class distributions is obtained using all pixels in the training data $\mathbf{u}\in\mathcal{S}$, not just the subset $\mathcal{S}'$. We found this to improve the generalization of the classifiers slightly, especially for classes with few training examples.

3 Semantic Texton Forests

Semantic texton forests (stfs) are a specific form of randomized decision forests that can be used for both clustering and classification. The features f(u;ϕ) in stfs act on small image patches centered at pixel u of size Δ×Δ pixels, as illustrated in Fig. 15.2(a). The feature parameters ϕ denote one of the following functions: (i) the value v(u+δ,b) of a single pixel at u+δ in color channel b; (ii) the sum v(u+δ ₁,b ₁)+v(u+δ ₂,b ₂); (iii) the difference v(u+δ ₁,b ₁)−v(u+δ ₂,b ₂); or (iv) the absolute difference |v(u+δ ₁,b ₁)−v(u+δ ₂,b ₂)|. Here, function v denotes a look-up into the image pixel colors. The color channels b ₁ and b ₂ need not be the same.

To textonize an image, the Δ×Δ patch centered at each pixel u is passed down the stf resulting in semantic texton leaf nodes $\mathcal{L} = \{ l_{t} \}_{t=1}^{T}$ and the averaged class distribution p(c|u). Examples are shown in Fig. 15.1 and Fig. 15.3(b). A pixel-level classification based on the local distributions $P(c | \mathcal{L})$ gives poor but still surprisingly good accuracy (see Sect. 15.6.1). We will shortly describe in Sect. 15.3.3 how the bag of semantic textons can pool the statistics of semantic textons $\mathcal{L}$ and distributions $P(c|\mathcal{L})$ over an image region to form a much more powerful feature for image categorization and semantic segmentation.

Examples of the appearance clusters learned in stfs are given in Fig. 15.2(b).

3.1 Learning Invariances

Although using raw pixels as features is much faster than first computing descriptors or filter-bank responses, one risks losing their inherent invariances. To avoid this loss, we augment the training data with image copies that are artificially transformed geometrically and photometrically [214]. This augmentation allows one to learn the right degree of invariance required by suitably designing these transformations for a particular problem. In our experiments we explored small rotations and scalings, and left-right flipping as geometric transformations, and affine photometric transformations.

3.2 Implementation Details

As discussed in Part I of this book, forests can be trained both in a supervised or unsupervised manner. Similarly, an stf can be trained using (i) pixel-level supervision, (ii) weak supervision, in which the members of the set of classes present in the whole image are used as training labels for all pixels, or (iii) no supervision, where the split function that most evenly divides the data is used. In the unsupervised case, the stf forest acts only as a hierarchical clusterer, not a classifier, similar to the density forests of Chap. 6. We examine the effect of different levels of supervision in Sect. 15.6.

We found the CIELab color space to generalize better than RGB, and it is used in all experiments. Training example pixels are taken on a regular grid (every 5×5 pixels) in the training images, excluding a narrow band of $\frac{\varDelta }{2}$ pixels around the image border to avoid artifacts; at test time, the image is extended to ensure a smooth estimate of the semantic textons near the border.

3.3 Bags of Semantic Textons

A popular and powerful method for categorizing images and detecting objects is the bag of words model [81, 348, 416]. A histogram of visual words is created over the whole image or a region of interest [67], either discarding spatial layout or using a spatial hierarchy [204]. The histogram is used as input to a classifier to recognize object categories. We propose the localized bag of semantic textons (bost), illustrated in Fig. 15.4. This extends the bag of words to be hierarchical and to include low-level semantic information, as follows.

Given the leaf nodes $\mathcal{L}(\mathbf{u}) = \{ l_{t} \}_{t=1}^{T}$ and the inferred class distribution p(c|u) for each pixel u, one can compute the following over an image region F: (i) a non-normalized histogram G _F(j) that concatenates the occurrences of tree nodes j across the different trees [253]; and (ii) a prior over the region given by the average class distribution p(c|F)=∑_u∈F p(c|u). In contrast to [253], we include both leaf nodes l and split nodes j in the histogram, noting that G _F(j)=∑_{j′∈child(j)} G _F(j′). The histogram therefore uses the hierarchy of clusters implicit in each tree. Each p(c|u) is already averaged across trees, and hence there is a single region prior p(c|F) for the whole forest.

Our results in Sect. 15.6 show that the histograms and region priors are complementary, and that the hierarchical clusters are better than the leaf node clusters alone. For categorization (Sect. 15.4), we use bosts where the region is the whole image. For segmentation (Sect. 15.5), we use a combination of bosts over many local rectangular regions to model layout and context.

Implementation Details

The counts of tree root nodes hold no useful information and are not included in the histograms. The histograms are sparse near the leaves, and can be stored efficiently since the histogram counts at the parent split node can be quickly computed on-the-fly. If the region F is rectangular, the histograms and class distributions can be calculated very efficiently using integral histograms [295, 342].

4 Image Categorization

The task of image categorization is to determine those categories (e.g. dog images, beach images, indoor images) to which an image belongs. For our purposes, every image belongs to those categories for which there exists a pixel in the image that has been labeled with that category. Thus, an image with a sheep eating grass will belong to both the ‘grass’ and ‘sheep’ categories. Example previous approaches have used global image information [276], bags of words [104] or textons [404].

We propose an image categorization algorithm that exploits the hierarchy of semantic textons and the node prior distributions p _j(c). This algorithm uses a non-linear support vector machine (SVM), though of course decision forests could also be used instead. The SVM depends on a kernel function K that defines the similarity measure between images. To take advantage of the hierarchy in the stf, we adapt the pyramid match kernel [141] to act on a pair of bost histograms computed across the whole image.

Consider first the bost histogram computed for just one tree in the stf. The kernel function (based on [141]) is then

$$ K (P, Q) = \frac{1}{\sqrt{Z}} \tilde{K} (P, Q), $$

(15.3)

where Z is a normalization term for images of different sizes computed as

$$ Z = \tilde{K}(P, P)\tilde{K}(Q, Q) . $$

(15.4)

Here, $\tilde{K}$ is the actual matching function, computed over levels of the tree as

$$ \tilde{K}(P, Q) = \sum_{d=1}^D \frac{1}{2^{D - d + 1}} (\mathcal {G}_d - \mathcal{G}_{d+1}), $$

(15.5)

using the histogram intersection $\mathcal{G}$

$$ \mathcal{G}_d = \sum_j \min \bigl(P_d [j], Q_d [j]\bigr). $$

(15.6)

In the above, D is the maximum depth of the tree, P and Q are the hierarchical histograms, and P _d and Q _d are the portions of the histograms at depth d, with j indexing over all nodes at depth d. There are no nodes at depth D+1, hence $\mathcal{G}_{D+1} = 0$. If the tree is not full depth, missing nodes j are simply assigned P _d[j]=Q _d[j]=0.

The kernel over all trees in the stf is calculated as K=∑_t γ _t K _t with mixture weights γ _t. Similarly to [416], we found $\gamma_{t} = \frac{1}{T}$ to result in the best categorization results. While effective, this kernel can be improved by using the learned ‘prior’ distributions p _j(c) from the stf. We build a 1-vs.-all SVM kernel K _c per category, in which the count for node j in the bost histogram is weighted by the value p _j(c).^{Footnote 1} This weighting helps balance the categories, by selectively down-weighting those that cover large image areas (e.g. grass, water) and thus have inappropriately strong influence on the pyramid match, masking the signal of smaller classes (e.g. cat, bird).

In Sect. 15.6.2, we show the improvement that the pyramid match kernel on the hierarchy of semantic textons gives over a radial basis function on histograms of just leaf nodes. We also obtain an improvement using the per-category kernels K _c instead of a global kernel K. Finally, we show how this categorization can act as an image-level prior for segmentation in Sect. 15.5.1.

5 Semantic Segmentation

To demonstrate the power of the bosts as features for segmentation, we adapt the TextonBoost algorithm [342]. The goal is to segment an image into coherent regions and simultaneously infer the class label of each region (see Sect. 15.6.3.1).

Appearance Context vs. Semantic Context

In [342], a boosting algorithm selected features based on localized counts of textons to model patterns of texture, layout, and context. The context modeled in [342] was appearance-based, for example: sheep often stand on something green. We adapt the rectangle count features of [342] to act on both the semantic texton histograms and the bost region priors. The addition of region priors allows us to model context based on semantics [303], not just texture. Continuing the example, our model can capture the notion that sheep often stand on grass. This concept of basing the output of one classifier as the input to another was proposed concurrently by [341] (the original version of this chapter) and [375]. The related idea of entanglement is explored in Chap. 19.

The segmentation algorithm works as follows. For speed we use a second classification forest in place of the boosting classifier used by [342]. We train this forest to act at image pixels u, using pixels on a regular grid as training examples. At test time, the segmentation forest is applied at each pixel u densely or, for more speed, on a grid. The most likely class in the averaged category distribution (15.1) gives the final segmentation for each pixel. The split node functions f now compute either the count G _F+u(j) of semantic texton j, or the probability p(c∣F+u) of class c, within rectangle F translated relative to pixel u. By translating rectangle F relative to the pixel u being classified, and uniformly sampling rectangles F within a box offset from u by up to half the image size, such features can exploit texture, layout and context information (see [342] for more details). Our extension to these features exploits semantic context by using the region prior probabilities p(c|F+u) inferred by the semantic textons. We show the benefit this brings in Sect. 15.6.3.

5.1 Image-Level Prior

We could embed the above segmentation forest in a conditional random field model to achieve more coherent results or to refine the grid segmentation to a per-pixel segmentation [157, 342]. Instead, we decided to investigate a simpler and more efficient approach using the image categorizer we built in Sect. 15.4. For each test image we separately run the categorization and segmentation algorithms. This gives an image-level prior (ilp) distribution p(c) and a per-pixel segmentation distribution p(c|u) respectively. We use the ilp to emphasize the likely categories and discourage unlikely categories, by multiplying the somewhat independent distributions as p′(c|u)=p(c|u)p(c)^α, using parameter α to soften the prior. We show in Sect. 15.6.3 and Sect. 15.6.3.1 how the addition of the ilp gives a considerable improvement to the resulting segmentations. Li and Fei-Fei [219] proposed a related idea that uses scene categorization as priors for object detection.

6 Experiments

We performed experiments on the following two datasets:

	# classes	# training images	# test images
MSRC [342]	21	276	256
VOC 2007 (Seg) [99]	21	422	210

We use the standard train/test splits where available, and the hand-labeled ground truth to train the classifiers. Image categorization accuracy is measured as mean average precision [99]. Segmentation accuracy is measured as the category average accuracy (the average proportion of pixels correct in each category). We also report the global accuracy (total proportion of pixels correct), but note that the category average is fairer and more rigorous as it normalizes for category bias in the test set. Training and test times are reported using an unoptimized C# implementation on a single 2.7 GHz core.

6.1 Learning the Semantic Texton Forest Vocabulary

Before presenting in-depth results for categorization and segmentation, let us look briefly at the stfs themselves. In Figs. 15.1 and 15.3 we visualize the inferred leaf nodes $\mathcal{L}(\mathbf{u})$ for each pixel u and the most likely category $c^{\star}(\mathbf{u}) = \mathop {\mathrm {arg\,max}}_{c} p(c | \mathbf{u})$. Observe that the textons in each tree capture different aspects of the underlying texture and that even at such a low level the distribution p(c|u) contains significant semantic information. Table 15.1 gives a naïve segmentation baseline on the MSRC dataset by comparing c ^⋆(u) to the ground truth, with either fully or weakly supervised training pixels (see Sect. 15.3.2).

Table 15.1 Naïve segmentation baseline on MSRC

Full size table

Clearly, this segmentation is poor, especially when trained in a weakly supervised manner, since only very local appearance (and no context) is used. Even so, the signal is remarkably strong for such simple features (random chance is under 5 %). We show below how using semantic textons as features in higher-level classifiers greatly improves these numbers, even with weakly supervised or unsupervised stfs.

Except where otherwise stated, we used stfs with the following parameters, hand-optimized on the MSRC validation set: distance Δ=21, T=5 trees, maximum depth D=10, 500 feature tests and 10 threshold tests per split node, and bagging using $\frac{1}{4}$ of the data per tree, resulting in approximately 500 leaves per tree. Training the stf on the MSRC dataset took only 15 minutes.

6.2 Image Categorization

We performed an experiment on the MSRC data to investigate our SVM categorization algorithm. The mean average precisions (AP) in Table 15.2 compare our modified pyramid match kernel (PMK) to a radial basis function (RBF) kernel, and compare the global kernel K to the per-category kernels K _c. In the baseline results with the RBF kernel, only the leaf nodes of the stf are used, separately per tree, using term frequency/inverse document (‘tf/idf’, from standard information retrieval) frequency to normalize the histogram. The PMK results use the entire bost which for the per-category kernels K _c are weighted by the prior node distributions p _j(c). As can be seen, the pyramid match kernel considerably improves on the RBF kernel. By training a per-category kernel, a small but noticeable improvement is obtained. For the image-level prior for segmentation, we thus use the PMK with per-category kernels. In Fig. 15.5 we plot the global kernel accuracy against the number T of stf trees, and see that categorization accuracy increases with more trees though it eventually levels out.

Table 15.2 Image categorization results. (Mean AP)

Full size table

Further categorization experiments are provided in Tables 15.3 and 15.4.

Table 15.3 MSRC categorization results. The values shown are the average precision (AP) and the area under the ROC curve (AuC) for the MSRC21 dataset

Full size table

Table 15.4 VOC2007 categorization results. The values shown are the average precision (AP) and the area under the ROC curve (AuC) for the VOC2007 segmentation dataset

Full size table

6.3 Semantic Segmentation

6.3.1 Experiments on the MSRC Dataset [342]

We first examine the influence of different aspects of our system on segmentation accuracy. We trained segmentation forests using (a) the histogram G _F(l) of just leaf nodes l; (b) the histogram G _F(j) of all tree nodes j; (c) just the region priors p(c|F); (d) the full model using all nodes and region priors; (e) the full model trained without random geometric/photometric transformations; (f) all nodes using an unsupervised stf (no region priors are available); and (g) all nodes using a weakly supervised stf (only image labels). The category average accuracies are given in Table 15.5 with and without the image-level prior (ilp).

Table 15.5 Comparative segmentation results on MSRC

Full size table

There are several conclusions to draw. (1) In all cases the ilp improves results. (2) The hierarchy of clusters in the stf gives a noticeable improvement. (3) The region priors alone perform remarkably well. Comparing to the segmentation result using only the stf leaf distributions (34.5 %) this shows the power of the localized bosts that exploit semantic context. (4) Each aspect of the bost adds to the model. While, without the ilp, score (b) is slightly better than the full model (d), adding in the ilp shows how the region priors and textons work together.^{Footnote 2} (5) Random transformations of the training images improve accuracy by adding invariance. (6) Accuracy increases with more supervision, but even unsupervised stfs allow good segmentations.

Given this insight, we compare against [342] and [384]. We use the same train/test split as [342] (though not [384]). The results are summarized in Fig. 15.6 with further examples given in Fig. 15.9. Across the whole challenging dataset, using the full model with ilp achieved a class average accuracy of 66.9 %, a considerable improvement on both the 57.7 % of [342] and the 64 % of [384]. The global accuracy also improves slightly on [342]. The image-level prior improves accuracy for all but three classes, but even without it, results are still highly competitive with respect to other methods. Our use of balanced training has resulted in more consistent accuracy across classes, and significant improvements for certain difficult classes: cow, sheep, bird, chair, and cat. We do not use a Markov or conditional random field, which would likely further improve our accuracy [342].

These results used our learned and extremely fast stfs, without needing any hand-designed filter-banks or descriptors that are potentially slow to compute. Extracting the semantic textons at every pixel takes an average of only 275 milliseconds per image, categorization takes 190 ms, and evaluating the segmentation forest only 140 ms. For comparison [342] took over 6 seconds per test image, and [384] took an average of over 2 seconds per image for feature extraction and between 0.3 to 2 seconds for estimating the segmentation. Our algorithm is well over 5 times faster and improves quantitative results. Minor optimizations have subsequently led to a real-time system that runs at over 8 frames per second.

6.3.2 Experiments on the VOC 2007 Segmentation Dataset [99]

This dataset contains 21 challenging categories including background. We trained a stf, a segmentation forest, and an ilp on these data, using the ‘trainval’ split and keeping parameters as for MSRC. The results in Fig. 15.7 compare with [99]. Our algorithm performs over twice as well as the only segmentation entry (Brookes), and the addition of the ilp further improves accuracy by 4 %. The actual winner of the segmentation challenge, the TKK algorithm, used segmentation by detection that fills in the detected object bounding boxes by category. To see if our algorithm could use a detection-level prior dlp (identical to the ilp but using the detected bounding boxes and varying with image position) we took the TKK entry output as the dlp. Our algorithm gave a 12 % improvement over the TKK segmentation by detection, highlighting the power of stfs as features for segmentation.

Since the original publication of this work [341], there has been considerable progress in the field. Please see the latest PASCAL VOC challenge for the state of the art.

7 Conclusions

This chapter presented semantic texton forests as an efficient method for encoding textons. stfs do not depend on local descriptors or expensive k-means clustering, and when supervised during training they can infer a distribution over categories at each pixel. We showed how bags of semantic textons enabled state-of-the-art accuracy on challenging datasets for image categorization and semantic segmentation, and how the use of an inferred image-level prior significantly improves segmentation results. The substantial gains of our method over traditional textons are training and testing efficiency and improved quantitative performance.

The main limitation of our system is the large dimensionality of the bag of semantic textons, which necessitates a trade-off between the memory usage of the semantic texton integral images and the training time if they are computed at runtime. However, using just the region priors is more memory efficient.

As future work, it would be interesting to investigate how stfs might be used for image reconstruction. A few examples of a simple experiment are given in Fig. 15.8.

Notes

1.
At training time, we compute and store the distributions p _j(c) for all nodes j in the tree, not just for leaf nodes.
2.
This effect may be due to segmentation forest (b) being over-confident: looking at the five most likely classes inferred for each pixel, (b) achieves 87.6 % while (d) achieves a better 88.0 %.

References

Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7)
Google Scholar
Bosch A, Zisermann A, Muñoz X (2007) Image classification using random forests and ferns. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1)
Google Scholar
Chum O, Zisserman A (2007) An exemplar model for learning object classes. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV intl workshop on statistical learning in computer vision
Google Scholar
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proc intl conf on machine learning (ICML)
Google Scholar
Everingham M, van Gool L, Williams C, Winn J, Zisserman A (2007) The Pascal visual object classes (VOC) challenge. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 36(1)
Google Scholar
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
He X, Zemel RS, Carreira-Perpiñán MÁ (2004) Multiscale conditional random fields for image labeling. In: Proc IEEE conf computer vision and pattern recognition (CVPR), June 2004, vol 2
Google Scholar
Julesz B (1981) Textons, the elements of texture perception, and their interactions. Nature 290(5802)
Google Scholar
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proc IEEE intl conf on computer vision (ICCV), vol 1
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Lepetit V, Lagger P, Fua P (2005) Randomized trees for real-time keypoint recognition. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2)
Google Scholar
Malik J, Belongie S, Leung T, Shi J (2001) Contour and texture analysis for image segmentation. Int J Comput Vis 43(1)
Google Scholar
Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. Int J Comput Vis 60(1)
Google Scholar
Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering forests. In: Advances in neural information processing systems (NIPS)
Google Scholar
Nistér D, Stewénius H (2006) Scalable recognition with a vocabulary tree. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proc European conf on computer vision (ECCV). Springer, Berlin
Google Scholar
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Vis Percept Prog Brain Res 155(1)
Google Scholar
Porikli FM (2005) Integral histogram: a fast way to extract histograms in Cartesian spaces. In: Proc IEEE conf computer vision and pattern recognition (CVPR), vol 1
Google Scholar
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
Schindler G, Brown M, Szeliski R (2007) City-scale location recognition. In: Proc IEEE conf computer vision and pattern recognition (CVPR), Minneapolis, June 2007
Google Scholar
Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Shotton J, Winn JM, Rother C, Criminisi A (2009) TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int J Comput Vis 81(1)
Google Scholar
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
Tu Z, Bai X (2010) Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans Pattern Anal Mach Intell 32(10)
Google Scholar
Tuytelaars T, Schmid C (2007) Vector quantizing feature space with a regular lattice. In: Proc IEEE intl conf on computer vision (ICCV)
Google Scholar
Varma M, Zisserman A (2005) A statistical approach to texture classification from single images. Int J Comput Vis 62(1–2)
Google Scholar
Verbeek J, Triggs B (2007) Region classification with Markov field aspect models. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Winder S, Brown M (2007) Learning local image descriptors. In: Proc IEEE conf computer vision and pattern recognition (CVPR)
Google Scholar
Winn J, Criminisi A, Minka T (2005) Categorization by learned universal visual dictionary. In: Proc IEEE intl conf on computer vision (ICCV), Beijing, China, October 2005, vol 2
Google Scholar
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2)
Google Scholar

Download references

Acknowledgements

We would like to thank J. Winn, B. Wenger, O. Yamaguchi, and V. Viitaniemi for helpful conversations and insights contributing to the work in this paper.

Author information

Authors and Affiliations

Unicorn Media, Temple, USA
M. Johnson
Microsoft Research, Cambridge, UK
J. Shotton
University of Cambridge, Cambridge, UK
R. Cipolla

Authors

M. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
J. Shotton
View author publications
You can also search for this author in PubMed Google Scholar
R. Cipolla
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Ltd., 7 J.J. Thomson Avenue, Cambridge, CB3 0FB, United Kingdom
A. Criminisi
Microsoft Research Ltd., 7 J.J. Thomson Avenue, Cambridge, CB3 0FB, United Kingdom
J. Shotton

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Johnson, M., Shotton, J., Cipolla, R. (2013). Semantic Texton Forests for Image Categorization and Segmentation. In: Criminisi, A., Shotton, J. (eds) Decision Forests for Computer Vision and Medical Image Analysis. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-4929-3_15

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4929-3_15
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4928-6
Online ISBN: 978-1-4471-4929-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semantic Texton Forests for Image Categorization and Segmentation

Abstract

Similar content being viewed by others

A Powerful and Efficient Method of Image Segmentation Based on Random Forest Algorithm

Semantic Segmentation with Millions of Features: Integrating Multiple Cues in a Combined Random Forest Approach

Fast and Scalable Image Retrieval Using Predictive Clustering Trees

Keywords