1 Introduction

Almost all imaging applications and photo-management systems require that images are correctly oriented before processing and visualization. For example, most of the applications for image detection and scene classification, heavily rely on the fact that the given images are up-side.

The correct orientation of an image is defined as the orientation in which the scene originally occurred [21, 23]. When no correction is applied, the orientation of a photograph is determined by the rotation of the camera at the moment the picture was taken. Even though any angle is possible, rotations by multiple of 90° are the most common. They are also straightforward to correct once detected. Therefore, it is common to assume that the images have been taken in one of the four orientations 0°, 90°, 180°, 270° (that sometimes are called ‘North’, ‘West’, ‘South’ and ‘East’).

Information about the orientation of a photograph may be obtained from sensors incorporated into the camera and recorded in the EXIF [8] meta data tags. However this information is often missing on low-end digital cameras or could have been removed by photo editing software. In these cases the user’s intervention is required.

Humans can identify the correct orientation of photographs by exploiting their image understanding capabilities. An extensive study on the psychophysical aspects on image orientation recognition was presented in [15]. Using a panel of 26 observers that evaluated 1,000 images, the authors gained a number of interesting insights. They observed that for typical images, accuracy is close to 98 % when using all available semantic cues from high-resolution images, and 84 % when using only low-level vision features and coarse semantics from thumbnails. Some semantic cues stood out as being very important for the correct orientation recognition (e.g. sky, and people). The same study also shows that an image resolution of 256 × 384 is enough for humans in order to achieve a high accuracy.

The manual correction of image orientation is a tedious, time-consuming and error-prone activity. This is particularly true when large collections of photographs have to be processed. For these cases (digital archives, websites, content-base retrieval systems, workflow management for professional photographers...) an automatic approach would be helpful. Devising a computational approach for automatic detection of image orientation mimicking the high-level human understanding capabilities is a challenging task. Several semantic cues’ detectors would be required to cope with the great variability of image content. Therefore, this approach tend to be computationally expensive. Moreover, its accuracy would greatly depend on the capability of bridging the semantic gap between the high-level cues and low-level features [7].

In this work we show that it is possible to devise an image orientation detection algorithm based purely on low-level features whose performances are comparable with those of human observers. The features are derived from Local Binary Patterns (LBP) [17], that are efficiently processed by a linear classifier obtained by logistic regression.

We have used a sub-set of the SUN image database [24] to test our proposal. This set contains 108,754 images divided into 397 scene categories. The experiments assessed the performances of our orientation detection algorithm with respect to specific scene types also taking into account the influence of color, images’ resolution, and size of the training set. Our algorithm outperforms similar approaches in the state of the art, and shows an accuracy comparable with that reported by Luo et al. [15] for human observers.

1.1 Related work

Some orientation detection methods in the state of the art rely on low-level features to represent those cues that can be analyzed by a classifier to predict the most probable orientation. For instance, Vailaya et al. [21] used color moments, color histograms, edge direction histograms, and MSAR texture features to described the images after their subdivision in 10 × 10 blocks. They used a learning vector quantizer to extract a small codebook that they used to estimate the class-conditional densities of the observed features needed for the Bayesian methodology. They reported 97 % of classification accuracy, obtained on a subset of high quality images from the Corel photo collection.

Wang and Zhang [23] exploited both chrominance and luminance information. Color moments are computed over 48 peripheral sub-blocks of a 8 × 8 blocks image subdivision, while edge direction histogram is used to characterize the image structure and texture. This information is then processed by different SVM (Support vector Machine) classifiers. Static classifier combination and hierarchical trainable classifier combination approaches are investigated. They reported an accuracy of 78 % on another subset of the Corel images.

Lyu et al. [16] proposed a method based on a set of natural image statistics collected from a multi-scale multi-orientation image decomposition. A two-stage hierarchical classification with binary SVM classifiers is employed to determine image orientation. Experiments performed on 18,040 natural images of different source and contents showed that the proposed method achieved about 60 % accuracy.

Lumini and Nanni [13] used color moments, Harris corner, phase symmetry, and edge direction histograms to describe the images. They then used Borda count to combine different classifiers based on Support Vector Machines, Parzen windows, and statistical classifiers. They obtained a 62 % accuracy on 6,000 images scanned from 350 rolls of film.

Baluja [3] used hundreds of classifiers trained with AdaBoost to determine the upright orientation of an image. 3,930 features related to color and edge information, are extracted from image subregions. Weak binary classifiers are used each built to compare a pair of features. The best set of 1,000 weak classifiers are then selected using the AdaBoost algorithm and combined to obtain a strong classifier. He reported the results obtained on several data sets, the accuracy on the largest one (Corel Disk-6, 15,888 images) is 61.9 %. A combination of 180 different strong classifier is also investigated and the accuracy on the same data set increased to 65 %. If a rejection rule is introduced, the accuracy on the Corel Disk-6 data set increases to 80.3 %.

Tolstaya’s [20] approach is based on the assumption that the area in the lower part of an image has more texture that the other regions. Features are computed on local regions of the image and comprise luminance, chrominance and texture information. A two stage classification approach based on AdaBoost is used to detect the image orientation. A rejection scheme is also introduced. At the lowest rejection rate, the accuracy obtained is 87 % on a data set of 861 outdoor images.

A method explicitly designed to require low computational resources is the method proposed by Appia et al. [2]. Their algorithm is based on simple gradient and intensity features extracted from peripheral image sub-blocks. The orientation is determined by a set of heuristic rules, and a rejection threshold is used to discard ambiguous results. A test on 200 consumer images showed an accuracy of 74 % without the rejection threshold and 86 % with the rejection threshold.

Human observers are clearly more accurate in detecting the correct image orientation when they are allowed to take into account high-level semantic cues [15]. For this reason, some works have been proposed to exploit the information obtained by recognizing distinguishable elements in the image such as faces, sky, grass, etc. For instance, Lei Wang [22] used both low-level and high-level features: orientation of faces, position of the sky, brighter regions, textured objects, and symmetry. The cues are combined in a Bayesian framework obtaining an accuracy of 94 % on a data set of 1,287 images.

Luo and Boutell [14] developed a probabilistic approach to image orientation detection via confidence-based integration of low-level and semantic cues within a Bayesian framework. Semantic information is provided by suitable detectors designed to detect faces, blue and cloudy sky, grass, and ceilings/walls. They reported 90 % accuracy on a set of 3,652 unconstrained consumer photos.

Ciocca et al. [5] combined both low-level features and faces. The approach uses the detection of faces as a hint to deem the image to be upward. When the image does not contain any detectable faces, the orientation is determined by an image classifier based on three low-level features: edge direction histogram, the first two moments in the YCbCr color space and a vertical coherence vector. Classifications performed with AdaBoost algorithm on a set of weak binary classifier. Using a-priori orientation probabilities, on the largest data set composed of about 4,000 images downloaded from the Web, the overall accuracy obtained is 86 %.

Borawski et al. [4] use the region of the sky to distinguish the orientation of outdoor images. The rationale is that the sky visible within an image is different for landscape- and portrait-oriented images. The localization of the sky within an image is based on the color. Fourier analysis is carried out to determine the orientation of the texture in the sky region. The method has been evaluated on 100 digital images containing the sky: 14 images have been rejected and six have been misclassified.

As it can be seen, the methods proposed in the literature show a wide range of accuracy values. Certainly a reason for this is the heterogeneous data sets used in evaluating the methods. Some of these data sets are small or specific for certain image categories that bias the overall results. For some categories such as landscapes, the correct orientation can be easily detected. On the other hand, indoor scenes, close-ups, or images with cluttered background are more difficult to classify since they lack important visual cues. For instance, Zhang et al. [25] separately tested their orientation detector on indoor and outdoor images. The accuracy they obtained on indoor images is much lower than that on outdoor images (48 % vs. 85 %). For this reason they introduced an indoor and outdoor classifier to refine the orientation detection obtaining an accuracy of 81 %.

Table 1 summarizes the aforementioned orientation detection methods.

Table 1 Summary of the orientation detection methods in the state of the art

2 Proposed algorithm

The method we propose is based solely on the information provided by low-level features, that is, features that can be reliably extracted from the images without any a-priori knowledge about their content. By not using high-level features, not only we keep manageable the complexity of the algorithm, but we also avoid the inherent sensitivity to the imaging conditions due to the semantic gap between the features and the image semantics. In other words, we hypothesize that full image understanding is not required for a reliable detection of the image orientation, and that the information provided by low-level features, when processed by a suitable classifier, is enough to obtain a good accuracy for a great variety of image contents.

In the literature, most of the methods based on low-level features focused on color and edge/texture information. Intuitively, color distribution is a very useful clue. However, there are several image categories (e.g. indoor images) where it does not help very much. Therefore, we decided to concentrate on a texture descriptor. More in detail, we decided to use features based on the distribution of Local Binary Patterns (LBP). These feature vectors lie in a high-dimensional space, of the kind for which linear classifiers are a very common choice. In this work we built a linear classifiers by using a regularized logistic regression.

Figure 1 depicts a schematic view of the proposed method that, for the sake of brevity, in the following we will refer to as LBP-LLR (from Local Binary Patterns and Linear Logistic Regression).

Fig. 1
figure 1

The proposed LBP-LRR method for the detection of image orientation

2.1 Image features

Local Binary Patterns have shown remarkable discriminative power in different domains due to their invariance with respect to lighting conditions, and robustness with respect to image noise. For example, LBPs have been used in face recognition [1], multi-object tracking [19], and scene classification [11]. For a comprehensive overview about LBP readers can refer to [18].

The LBP descriptor is defined as a histogram of the local patterns surrounding each pixel. These patterns are computed by thresholding the intensity of the neighbors of each pixel with the intensity of the pixel itself (see Fig. 2). More in detail, given a neighborhood size P and a radius R, for each pixel the numerical code LBP P, R is computed as follows:

$$ LBP_{P,R} = \sum\limits_{p = 0}^{P - 1} s(g_{p} - g_{c}) 2^{p}, $$
(1)

where g c is the gray level of the current pixel, g 0,…,g P−1 are the gray levels of its neighbors, and s is defined as s(x) = 1 if x ≥ 0, s(x) = 0 otherwise. The P neighbors lie on a circular neighborhood, of radius R, of the current pixel: the gray value g p is obtained by interpolating the intensity image at a displacement (R cos(2πp/P), R sin(2πp/P)).

Fig. 2
figure 2

The first steps of the Local Binary Pattern extraction. For each pixel, a circular neighborhood is considered. Each neighbor is thresholded by the intensity of the central pixel determining a binary response. The pattern is formed by concatenating the resulting bits

With P neighbors there are 2P possible patterns, but not all of them are equally significant. Usually, only patterns describing a somewhat regular neighborhood are considered. These patterns are called “uniform” and are defined as those patterns for which there are at most two transitions (bitwise 0/1 changes) between adjacent bits in the code. For instance, the pattern ‘00011100’ is uniform, while the pattern ‘11001000’ is not uniform because it includes three transitions. The number of uniform patterns is 2 + P(P − 1). In fact, are uniform patterns those consisting of k zeros and Pk ones, where all the zeros or all the ones are consecutive. There is one pattern for k = 0, and one for k = P. For each value of k in the range {1,…,P − 1} there are P patterns, each one corresponding to a different rotation of the bits (see [18] for more details).

The circular shape of the neighborhood makes rotation invariance easy to achieve. However, we decided to not exploit this property of the LBP approach because rotation invariance would obviously discard important information about the orientation of the image.

To form a fixed length feature vector the patterns are aggregated into one or more histograms. Histograms are formed by counting the occurrences of each uniform pattern in a given region of the image. Non-uniform patterns are not ignored, but they are all accounted for in a single bin. The final descriptor is the concatenation of the normalized histograms. With H possibly overlapping regions and with P neighbors, the final descriptor length is H × (3 + P(P − 1)). In fact each of the H histograms has 2 + P(P − 1) bins for the uniform patterns and one bin for all the non-uniform ones.

2.2 Orientation recognition

Due to their capability in dealing with high-dimensional feature spaces, linear classifiers have become one of the most popular methods for image classification [6]. In fact, linear classifiers are very fast and very efficient learning methods exist for their training.

Typically, the learning procedure of a binary linear classifier consists in solving the following optimization problem:

$$ \min\limits_{\mathbf{w}}\frac{1}{2} \| {\mathbf{w}} \|^{2} + C \sum\limits_{i=1}^{m}\xi(\mathbf{w};{\mathbf{x}}_{i},y_{i}), $$
(2)

where x i denote the training samples (i = {1,…,m}) and y i ∈ {−1, +1} are the corresponding class labels. The optimal w defines a hyperplane that linearly separates positive from negative instances. The loss function ξ penalizes the errors on the training set, weighted by the penalization coefficient C. In practice, the parameter C determines a trade-off between the penalization and the regularization term ∥w2 (the norm ∥·∥1 instead of the Euclidean can be used as well). Linear Support Vector Machines are an example of linear classifier within this framework.

We used a very fast implementation of a regularized binary linear regression classifier as implemented in the LIBLINEAR package [9]. The penalization function is defined as

$$ \xi(\mathbf{w};\mathbf{x}_{i},y_{i})=\log\left(1+e^{-y_{i}\mathbf{w}^{T}\mathbf{x}_{i}}\right), $$
(3)

which is derived from a probabilistic model.

The optimization problem (2) is solved by the LIBLINEAR library using a trust region Newton method [12]. The problem of orientation detection is not binary, since there are four possible orientations. For multi-class problems LIBLINEAR uses the one-against-all strategy: for each class a binary problem is built to discriminate between instances of that class from the instances of all the other classes. Therefore, the classifier consists of the hyperplanes w 1,w 2,…w K , one for each of the K classes. Given a new instance x, the predicted label y ∈ {1,…,K} is obtained as:

$$ y = \arg\max\limits_{j} \mathbf{w}_{j}^{T} \mathbf{x}. $$
(4)

In the case of orientation detection we have K = 4 classes, corresponding to rotations of multiples of 90°.

2.3 Computational complexity

The LBP-LRR algorithm is very fast. When LBP histograms are computed on H disjoint regions, their computation is linear with respect to the number N of pixels in the image and to the cardinality P of the neighborhood. As stated before, the dimensionality of the resulting feature vector is H × (3 + P(P − 1)), and classification is linear with respect to the dimensionality of the feature space. Therefore, the procedure has a complexity in time of O(N × P + H × P 2). Note that the classification of the patterns as uniform or non-uniform can be obtained very quickly by using a precomputed look-up table with 2P entries.

2.4 Feature selection and tuning of the parameters

The computation of LBP features depends on several parameters: the neighborhood cardinality (P) and size (R), and whether or not they are uniform. Moreover, in order to introduce some locality into the final descriptor, usually histograms of LBPs are computed on different regions of the image, and such a subdivision need to be specified as well. These parameters, and those of the classifier (e.g. the penalization coefficient in (2)) have been tuned by estimating the classification accuracy with a five-fold cross-validation on the training set.

Different combinations of the parameters have been considered, and the best one consisted in using uniform LBPs with a neighborhood of cardinality P = 16, size R = 2. The best image subdivision resulted in the union of two partitions, one that uniformly divides the image in six horizontal bands, the other that divides it in six vertical bands. Therefore, in total 12 histograms are computed and concatenated to form the final descriptor. During parameters’ selection, we observed a good degree of stability with respect to the penalization coefficient: the best result has been obtained for C = 1.

One of the possible weaknesses of Local Binary Patterns is that, in their original form, they do not encode any information about the color distribution. While it is clear that for most images gray-level information is enough to unambiguously determine their orientation, color is recognized as an important clue. In fact, most algorithms in the state of the art heavily rely on the information provided by color distribution [5, 13, 14, 20, 21, 23, 25].

To assess the importance of the color information we tried to complement the LBP histograms with various color features (color moments in different color spaces and various kind of color histograms). The best results have been obtained by using color moments (mean and standard deviation) in the YUV color space, with the same image subdivision used for LBP histograms. An alternative method to include color information is to compute the LBP histograms independently on the components of a color space. We implemented this strategy by considering LBPs on the three RGB components (in the following we will refer to this algorithm as LBP-RGB).

3 Experimental results

Most existing orientation detection algorithms have been evaluated on small and homogeneous data sets (e.g. only outdoors images, all images with visible sky, etc.). An algorithm designed to be applied in real applications should be proved to be effective on a large, heterogeneous collection of images. To this end we have chosen to use the SUN image database [24] for our experiments. The database was collected by selecting from the available terms of WordNet [10] those describing concrete scenes, places, and environments. After the removal of synonyms the final set of terms numbered 899 image categories. For each term, images were retrieved from the Web by using different search engines obtaining a total of 130,519 images. As suggested in [24], we considered only those categories containing at least 100 images. The final image data set is thus composed of 108,754 images belonging to 397 categories. Figure 3 shows some representative images taken from different categories in this data set.

Fig. 3
figure 3

Some image categories from the SUN database

We divided the data set into a training and a test set. Starting from the 108,754 images of the SUN database, we randomly selected 2,500 images (about 2.3 % of the whole data set, see Section 3.4 for further considerations about the size of the training set) to form the training set. The remaining 106,254 form the test set, and are used to evaluate the methods. All the 397 categories are represented in the test set.

The orientation of the images have been already corrected by the authors of the SUN database and these images may be in the “landscape” layout (i.e. images which are wider than taller) or in the “portrait” layout. We altered the database to simulate the situation in which the images are taken with a digital camera that does not feature the automatic orientation capability. Images with a “landscape” layout retain their original orientation (i.e. North direction). Portrait images are randomly rotated clockwise or counter-clockwise by 90°, and labeled with the East and West orientations, respectively. No image has been labeled with the south orientation, because this would correspond to a picture taken with the camera turned upside down (an unrealistic case). Following this procedure all the images end up having a landscape layout. Of the 2,500 images in the training set 1,841 have been labeled with the North orientation (73.6 %) while the East and the West labels have been assigned to 340 (13.6 %) and 319 (12.8 %) images, respectively. Concerning the 106,254 images in the test set, 77,265 have been labeled as North (72.7 %), 14,621 as East (13.8 %), and 14,368 as West (13.5 %). These figures agree with the distribution reported by other authors. For instance, for consumer photos scanned from film, Luo and Boutell [14] reported 72 % North, 14 % East, 12 % West, and 2 % South (although uncommon it is possible in the case of scanned films).

Some other works in the state of the art preferred the generation of a balanced data set, where each orientation is equally represented. To do so, each image is randomly rotated. We preferred to keep the data set unbalanced because: (i) it better represents the conditions found in real applications; (ii) it keeps the correlation between the content and the layout of the images (after all the “portrait” and “landscape” layout are called this way because they are typically used for that kind of scenes).

3.1 Results

In the first experiment we compared variants of the proposed LBP-LRR method based on different features: LBP histograms, YUV moments, their combination, and LBP histograms on the RGB components. The results are reported in Table 2.

Table 2 Classification accuracy obtained on the test set by the variants of the LBP-LRR method

With the combination of LBP histograms and color moments the orientation of 98,200 images, out of the 106,254 images that form the test, has been correctly identified (92.4 %). Slightly worse results have been obtained without color information (91.0 %). This demonstrates that color moments, while not very useful when used alone (83.4 % of classification accuracy), can complement the information encoded by the LBP histograms resulting in a measurable improvement (even if to a limited extent). The use of LBP histograms on the RGB components, instead, did not cause any significant improvement (only 0.1 % better than original LBP histograms).

The SUN database has the advantage of being carefully organized in several semantic categories, making it possible to analyze in detail the behavior of the algorithms when dealing with different image contents. Figures 4 and 5 report the results obtained on the 397 categories by using LBP histograms combined with the color moments (in the following we will implicitly refer to this feature combination when not stated otherwise).

Fig. 4
figure 4

Detail of the classification accuracy obtained on the 397 categories of the SUN database by the LBP-LRR algorithm (top 200). The categories are listed by decreasing accuracy, and are depicted according to their macro category (indoor, outdoor man-made, outdoor natural). Some categories belong to more than one macro category and are indicated by a hatched bar

Fig. 5
figure 5

Detail of the classification accuracy obtained on the 397 categories of the SUN database by the LBP-LRR algorithm (bottom 197). The categories are listed by decreasing accuracy, and are depicted according to their macro category (indoor, outdoor man-made, outdoor natural). Some categories belong to more than one macro category and are indicated by a hatched bar

For 17 categories (from ‘athletic field’ to ‘volleyball court’ in the figure) the orientation of all the images has been correctly detected. This is quite remarkable because these categories are quite heterogeneous, featuring a high degree of intra-class variability. For other 17 categories (including, for instance, ‘cafeteria’, ‘dam’, ‘butte’, ‘planetarium’) only one test image has been misclassified. At the other end of the spectrum we have the categories for which the accuracy of the classifier is very low: ‘doorway’, ‘pulpit’, ‘apse/indoor’ obtained a classification accuracy of less than 60.0 %. These categories contain many images that are cluttered, or underexposed (see the first row in Fig. 3).

Of the best 30 categories 27 are outdoor, and 18 of the worst 30 are indoor. This fact seems to confirm previous results in the literature [25] where has been shown that the orientation of indoor images is harder to detect than the orientation of outdoor images. However, if we look at our results on all the 397 categories we see that the differences between indoor and outdoor are not very evident.

The SUN database is also hierarchically organized: at the first level we have three macro categories, namely indoor, outdoor man-made, and outdoor natural. These are then further divided into several sub-categories. Table 3 reports the results with respect to this categorization. Differently from other studies in the state of the art [25], the performance are quite stable across the indoor/outdoor macro categories. The highest accuracy has been obtained on the ‘outdoor man-made’ category (93.5 %). On indoor images the accuracy was 90.9 %, which is slightly worse than the 92.7 % obtained on ‘outdoor natural’ images.

Table 3 Classification accuracy on the first two levels of categorization. Note that some images belong to multiple categories and that, to simplify the analysis, they have been ignored, here

Even within each macro category the accuracy on the sub categories are quite regular. Only in two cases the accuracy falls below 90 %: ‘cultural’ (87.5 %) and ‘shops, cities, towns’ (88.3 %). Nevertheless, the difference between the hardest and the easiest sub categories is more than 9 %. This fact suggests that results obtained on small data sets, that hardly cover all the sub categories, are prone to be biased. For large data sets the simple subdivision into indoor/outdoor man-made/outdoor natural (or even worse, into indoor/outdoor) is too coarse to fully understand the results.

Figure 6 shows some examples of the errors made by the algorithm. Errors typically occur when the images contain a large amount of details that make very difficult to identify, using only low-level features, those patterns which are clear indicators of the correct orientation. In most cases the correct orientation is difficult to determine “at a glance” even for human observers; on the contrary, there are images where our high-level understanding makes it evident and unambiguous.

Fig. 6
figure 6

A random sample of some of the errors made on the test set. Images are rotated according to the orientation detected

3.2 Comparison with other methods

We computed on the SUN database the performance of a selection of alternative methods from the state of the art. In particular, we focused on those methods for which the training procedure can be faithfully replicated without additional information or data. This criterion excludes the methods relying on the high-level features provided by specific object detectors requiring the use of additional data for training. The methods we considered are those proposed by Vailaya et al. [21], Tolstaya [20], Ciocca et al. [5], Appia and Narasimha [2].

The comparison has been obtained by using our own implementations of these methods. See Section 1.1 for a brief description. We used the same experimental protocol described before: training on 2,500 images (possibly with a five-fold cross validation for the model selection step) and test on the 106,254 images of the test set. The classification accuracies obtained are reported in Table 4.

Table 4 Classification accuracy obtained on the test sets by the LBP-LRR method, and by four algorithms from the state of the art

The results are quite clear: the LBP-LRR method outperforms the other methods considered. Note that the performance reported by the original authors may be quite different. For instance, Appia and Narasimha [2] reported a higher performance (74 %) than that shown here. This can be explained by the fact that, in order to achieve a very high processing speed they based their method on reasonable, but simple assumptions (i.e. that high intensity regions stay on top, and that high frequency regions lie at the bottom of the image). These assumptions usually hold for prototypical images (landscapes, indoor images with little or no clutter), but not for most of the images in the SUN database. Another example is the method of Vailaya et al. [21], for which the authors reported an accuracy of 98 % on a set of 8,364 Corel images mostly depicting uncluttered scenes with a clear subject as taken by professional photographers. With this method Luo and Boutell obtained 78 % [14] on a collection of 3,652 personal photographs. This difference depends on the properties of the data set used for the evaluation. In our experiments we used a much larger test set (more than 100,000 images) of different quality and resolution obtaining for the Vailaya et al. method a classification accuracy of 80.1 %.

More in detail, Table 5 reports the confusion matrices obtained by the five methods considered. All but one of the methods biased their decisions towards the ‘North’ orientation. This behavior has been learned from the training set, without any explicit indication. Similarly, the ‘South’ orientation has been virtually ignored. The method by Appia and Narasimha is an exception, since it is based on rules without any training procedure. The low performance we obtained with their method are also explained by its inability to exploit uneven prior distributions. The design of the method by Ciocca et al. does not completely rule out the ‘South’ orientation even if no training image has that orientation. However, the ‘South’ orientation is predicted in less than 1 % of the cases.

Table 5 Confusion matrices of the methods compared in experiments. Results are expressed in percentage. The diagonal elements (corresponding to correct classifications) are reported in bold

The effectiveness of the proposed approach is mostly due, in our opinion, to the design of the image descriptor. Local binary Patterns, in fact, are robust against several categories of image transformations. For instance, they are left unchanged by monotonic transformations of the pixels (see (1)), such as those caused by changes in the lighting conditions. moreover, the aggregation of the patterns into histograms makes the descriptor robust against small translations and scalings.

On the other hand, the descriptor is sensitive to rotations of the image plane. More in detail, rotations by multiples of 90° result in permutations of the feature vectors: the uniformity of the patterns is not affected, but the directionality of the uniform patterns changes according to the angle of rotation (see Fig. 2); due to the way in which the image is subdivided, the order in which the histograms are concatenated also changes in a predictable way (e.g. the first horizontal band becomes the first vertical after a counter-clockwise rotation of 90°, the last horizontal after a rotation of 180°, and the last vertical after a rotation of 270°).

In our framework, the choice of the classifier is not as important as the design of the descriptor. To verify this, we repeated the experiment by using different classifiers: linear and non-linear (Gaussian RBF) Support Vector Machines (SVM), and a nearest neighbor classifier. Their parameters have been selected by five-fold cross validation in the same way described before for the logistic linear regression. Table 6 reports the result obtained: with SVMs the accuracy is just a bit lower than with logistic regression. Clearly worse results have been obtained, instead, with the nearest neighbor classifier (using a k-NN with k > 1 did not bring any improvement).

Table 6 Classification accuracy obtained on the test set by different classifiers

3.3 Resolution of the images

Image resolution clearly influences the accuracy of the detection of the correct orientation. A psychological study about this issue has been conducted by Luo et al. [15]. They asked 26 subjects to detect the orientation of 1,000 images at five different resolution levels (24 × 36, 64 × 96, 128 × 192, 256 × 384, 512 × 768). They conclude that the performance of human observers can be considered as an upper bound for computer vision algorithms. This bound would be 84 % when coarse semantics is used (64 × 96 pixels, in their experiment) and 96 % when all the semantics are considered (512 × 768 pixels). Of course these figures depend on the data set considered.

To verify how much the resolution of the images influences the performance of our algorithm we measured its performance at the same resolution levels used by Luo et al.. Before training and test, images have been resampled in such a way that their longest side is 36, 96, 192, 384, or 768, according to the resolution level under consideration. The other side is changed to preserve the aspect ratio. Three variants of the algorithms are considered: LBP histograms combined with color moments, LBPs only, and color moments only. The resulting classification accuracies are reported in Fig. 7. To allow a rough comparison with the performance of human subjects the results obtained by Luo et al. are also shown, even though they have been obtained on a different data set.

Fig. 7
figure 7

Performance of the LBP-LRR method, varying the resolution of the images. Three variant are considered: LBP histograms combined with color moments, LBPs only, color moments only. The plot reports also the performance obtained by human subjects, as taken from [15]

The results clearly show that the LBP-LRR method takes advantage of the additional information provided by the higher level of resolution. As expected, color moments are virtually invariant with respect to the image resolution. For the lowest level of resolution, LBP features perform worse than color moments, but they quickly improve and are clearly better at medium and high resolutions. At the highest level, the performance of LBPs are still increasing, even though the behavior of the plot suggest that they are converging to a maximum. By using a combination of LBPs and color moments better results are obtained than by using a single feature.

3.4 Size of the training set

One of the advantage of using a large data set is that it allows to reliably assess how the performance depend on the size of the training set. To do so, we subdivided the data set into training and test sets of different sizes: the 2,500 images used before are not considered here (but we use the parameters found with the cross validation on those images). The remaining 106,254 images have been randomly partitioned into training and test set pairs. The cardinalities of the training sets are the powers of two from 32 to 65, 536. The test sets correspond to the complements of the training sets.

Figure 8 reports the results obtained with LBP histograms, color moments, and their combination. In all the three cases, the classification accuracy increases with the size of the training set. With color moments, no significant improvement is observed for more than 8,192 images. With LBPs the performance corresponding to the largest training set (65,536 images) are close to 94 %. We believe that an even larger training sets would allow the combination of the two features to match the performance obtained by human subjects.

Fig. 8
figure 8

Performance of the LBP-LRR method, as a function of the number of images in the training set (logarithmic scale). Three variant are considered: LBP histograms combined with color moments, LBPs only, color moments only

4 Conclusions

In this paper we have investigated the automatic detection of image orientation. We have shown that it is possible to devise an effective algorithm based purely on low-level features extracted from gray level images. More in detail, we have proposed the use of Local Binary Patterns for the description of the image content, and of a linear classifier obtained by regularized logistic regression. With this approach we obtained a remarkable classification accuracy (91.0 %). Only slightly better results (92.4 %) have been obtained by combining the LBP features with the color moments. In both the configurations the algorithm outperformed all the other detection algorithms considered, and it is close to the human performance as reported in the state of the art. Our findings are supported by the use of a large collection of images (more than 100,000) presenting a wide range of scene categories. The use of this data set allowed us to obtain reliable and insightful results: the accuracy of the algorithm is quite stable across the categories of the SUN database. About 75 % of the 397 categories have a detection accuracy above 90 %. In particular, unlike most algorithms in the state of the art, the performance on indoor and outdoor images are very similar (about 91 % vs. 93 %).

We also investigated the influence of image resolution on the algorithms performance: at lower resolution (i.e. 36 pixels), color seems to be more important than structure. Notwithstanding this, even without color, the accuracy is about 80 %. Concerning the size of the training set, we observed that even with very few (i.e. 32) training samples we can achieve a detection accuracy of more than 80 %.

The results obtained on the hierarchically organized categories allowed us to identify those types of scenes that are more problematic for our algorithm. These results are very insightful in that they provide directions for further improvements of our algorithm.

On the basis of these results we believe that the algorithm is suitable for the application in a variety of scenarios. We will make available the source code of our algorithms and the lists of images we used for training and test.