Technologies for Visual Localization and Augmented Reality in Smart Cities

Amato, Giuseppe; Cardillo, Franco Alberto; Falchi, Fabrizio

doi:10.1007/978-3-319-50518-3_20

Giuseppe Amato⁵,
Franco Alberto Cardillo⁵ &
Fabrizio Falchi⁵

Part of the book series: Geotechnologies and the Environment ((GEOTECH,volume 16))

1558 Accesses
2 Citations
1 Altmetric

Abstract

The widespread diffusion of smart devices, such as smartphones and tablets, and the new emerging trend of wearable devices, such as smart glasses and smart watches, has pushed forward the development of applications where the user can interact relying on his or her position and field of view. In this way, users can also receive additional information in augmented reality, that is, seeing the information through the smart device, overlaid on top of the real scene. The GPS or the compass can be used to localize the user when augmented reality has to be provided with scenes of large size, for instance, squares or large buildings. However, when augmented reality has to be offered for enriching the view of small objects or small details of larger objects, for instance, statues, paintings, or epigraphs, a more precise positioning is needed. Visual object recognition and tracking technologies offer very detailed and fine-grained positioning capabilities. This chapter discusses the techniques enabling a precise positioning of the user and the subsequent experience in augmented reality, focusing on algorithms for image matching and homography estimation between the images seen by smart devices and images representing objects of interest.

Access provided by CONRICYT-eBooks. Download chapter PDF

Towards Efficient Mobile Augmented Reality in Indoor Environments

Computer Vision for Mobile Augmented Reality

Fast and scalable structure-from-motion based localization for high-precision mobile augmented reality systems

Article Open access 19 July 2016

Keywords

1 Introduction

The widespread diffusion of smart devices, such as smartphones and tablets, and the new emerging trend of wearable devices, such as smart glasses and smart watches, has pushed forward the development of applications where the user can interact relying on his or her position and field of view. In this way, users can also receive the needed information in augmented reality, that is, seeing the information through the smart device, overlaid on top of the real scene.

For instance, users can have information on monuments, shops, hotels, or restaurants with their smartphones by simply pointing toward the subject of their interest. The relevant information is displayed aligned with the real scene observed.

Some existing augmented reality systems use a GPS and a compass to determine the position and field of view of users to align reality and synthetic images correctly. These systems can be used, to some extent, outdoors with large-scale landmarks such as squares and archaeological sites, provided the user is located close to a precise preplanned position. In fact, these techniques suffer from high imprecision that is not generally noticed by users if they are located in approximately the right position. For instance, it is possible to have augmented reality of the Coliseum, showing how it looked before it was damaged, provided the user is located in the right position. However, these techniques cannot be used to offer augmented reality when users are located in arbitrary positions with respect to the subject of interest, because of the possible misalignment between reality and artificial scenes, due to the imprecision of the localization. The problem is even worse when dealing with objects of smaller dimensions, such as statues or paintings. In those cases, localization with a GPS and a compass cannot be used reliably.

Recently, techniques to determine precisely the position of the user and of the smart device were defined, which leverage on content-based image retrieval (Datta et al. 2005) and object recognition. The point of view of the user with respect to the observed object can be determined precisely by matching the observed scene with a set of prealigned images. By using this information, it is possible to place the virtual scene (images, videos, 3D model, animations, etc.) according to the correct perspective and position on top of the real scene. To execute these tasks we need to recognize what the user is looking at, determining the relative position of the user (Amato et al. 2011a), tracking the user movement, and finally rendering the virtual scene with the correct perspective. These operations are executed in real-time, with fluidity and precision, to improve user experience.

The use of visual analysis for determining the position of the user has the advantage that user localization can be executed in places where GPS cannot be used, such as indoor areas, and is able to provide much finer precision, which other techniques cannot provide. This is particularity true in the presence of an object of small size, or in the case where users want to have specific information of small details of larger objects (for instance, an epigraph on a wall of a building). On the other hand, visual recognition and tracking require that automatic visual recognizers for the elements to be identified and tracked should be built.

Even relatively small areas might possibly contain hundreds of details of interest that should be “active” and should trigger an augmented reality experience when observed. Consider, for example, a square in an historical city. It might include several buildings, various statues, various frescoes, and various details that can be of interest to cultural tourists and that should be recognized and tracked by the smart devices to provide users with an augmented reality experience. Even if the computing power of smart devices is quickly improving, the resources required to perform real-time recognition and tracking of a large amount of subjects of interest are still a challenge. In this respect, techniques able to deal with large-scale archives of images to be recognized, and that can analyze images even with low-resource devices have been developed.

During the last few years, local descriptors, for instance, SIFT (Lowe 1999) and SURF (Bay et al. 2006), to cite two, have been widely used to support effective object recognition and visual tracking tasks. A local descriptor is generally an histogram representing statistics of the pixels in the neighborhood of an interest point (automatically) chosen in an image.

Executing object recognition tasks, relying on local features, is generally resource demanding. Each digital image, both queries sent by smart devices and images in the image libraries, are typically described by thousands of local descriptors. In order to decide that two images match because they contain the same object, local descriptors in the two images need to be pairwise compared, in order to identify matching patterns.

To reduce the cost of extracting, representing, and matching local visual descriptors, researchers have investigated the use of binary local descriptors, for example, BRISK (Leutenegger et al. 2011) or ORB (Rublee et al. 2011). With binary descriptors, each bin of the vector, representing the local descriptor, just contains 0 or 1. Binary descriptors are much faster to be extracted, are obviously more compact than nonbinary ones, and can also be matched faster by using the Hamming distance rather than the Euclidian distance. This allows their use on devices with low resources, such as smartphones, or when response time must be very short even in the presence of image libraries.

Reduction of the cost of image matching on a very large scale has been addressed by defining methods for quantizing and/or aggregating local features. Quantization methods, such as the bag of feature approach (BoF) (Sivic and Zisserman 2003), define a finite vocabulary of local descriptors, that is, a finite set of local descriptors to be used as representative. On the other hand, aggregation methods, such as Fisher vectors (FV) (Perronnin and Dance 2007) or vectors of locally aggregated descriptors (VLAD) (Jégou et al. 2010), analyze the local descriptors contained in an image to create statistical summaries that still preserve the effectiveness power of local descriptors and allow treating them as global descriptors.

Once an object has been recognized, local descriptors can also be used to compute an homography that maps the image seen by the smart device with the target image representing the recognized object. The computed homography embeds the needed information to estimate the relative position of the smart device precisely with respect to the object of interest. This allows a 3D rendering engine to be used to overlay the information associated with the object on top of the real scene with a precise alignment, continuously adapting to the movement of the device itself.

This chapter is organized as follows: In Sect. 20.2, we discuss the state-of-the-art technologies for automatic visual localization. Augmented reality goals, issues, and solutions are presented in Sect. 20.3.

2 Visual Localization

In this section, we consider the problem of automatically understanding the user information needs given a photo taken by the user with a mobile phone.

In this context, the computer vision literature focuses on answering three major questions:

Where was the picture taken?
What is in the photo?
Where are the objects in the photo located?

For each question, a specific task has been defined:

Mobile visual localization

aims at identifying the location in which a picture was taken. The visual approach has been proposed to overcome limitations of GPS-based approaches such as indoor environments, latency, and multipath effects (Schroth et al. 2011).

Visual landmark recognition

is the task of identifying a building or monument in photos (Weyand and Leibe 2015). Essentially, it is a specification of the more general object recognition task which is central in computer vision.

Visual photo mapping

considers the problem of placing photos on a map for organizing large datasets (Crandall et al. 2009). In this scenario, a photo taken from a long distance of an object (e.g., the Eiffel Tower) could be placed on the actual location of the object itself in order to improve user experience in browsing through the collection.

Coming back to the objective of this section, that is, understanding the user information needs, we believe that landmark recognition is the primary task. In fact, the most important question we have to answer is, “What is in the photo?” As an example, a tourist on top of the Eiffel Tower can take photos of many different tourist landmarks in Paris. If the visual query represents, for instance, Montmartre, the actual position of the tourist is not relevant as well as a precise localization of the landmark. Thus, even the localization of both the user and the landmark could be useful, however, the primary task is identifying the objects in the photo.

The minimum knowledge base for recognizing landmarks consists of a set of photos of each object that should be recognized. In the case where this set is either not available or not sufficient, public photos can be used. Many approaches have been proposed to leverage on Internet photo collections such as Flickr or Panoramio. In some particular cases, even the landmarks are automatically identified (Weyand and Leibe 2015) in public collections.

2.1 Object Recognition

Object recognition is essentially a classification task where the label to be assigned to a photo is the ID of the object. However, for both efficiency and scalability with respect to the number of objects to be recognized, most of the approaches perform content-based image retrieval (CBIR) in the dataset applying a k-nearest neighbor classifier on the results set. In other words, a landmark is recognized in a query image if similar images, containing the same object, are found in the knowledge base.

2.1.1 Local Features-Based Approaches

In the last decade, the research on object recognition has focused on describing and matching local features, including SIFT (Lowe 1999) and SURF (Bay et al. 2006), extracted from interest points. Both the interest points and the description of the region around them should be robust to illumination variations and affine transformations. Given images described in terms of their local descriptors (thousands per image), comparing any two images results in comparing their local descriptors or aggregated information obtained by considering all the local descriptors in each image. In fact, direct comparison of local features is not only computationally expensive, but it also does not scale. For image registration and pose estimation, local features matching is mandatory in order to understand the geometric relationships between the images. For higher-level tasks such as retrieval and classification, research has focused on aggregating the information related to each local pattern in the same image.

In 2007, the bag-of-(visual) words model was proposed in order to improve scalability of object recognition (Perronnin and Dance 2007). It consists in replacing original local descriptors with the ID of the most similar descriptor in a predefined vocabulary. Following the BoW approach, an image is described as a histogram of the occurrence of (hundreds of thousands) visual words over the global vocabulary. The BoW is then the result of the aggregation of the information contained in the local descriptors. However, as mentioned before, it also performs a quantization of the information reducing each local description to an ID of a visual word. In Jegou et al. (2008) the matching, based on visual words, was refined in order to improve effectiveness of the BoW approach in reducing the quantization. A survey of BoW approaches for landmark recognition has been given in Bhattacharya and Gavrilova (2013).

With the goal of improving both effectiveness and efficiency, two aggregation approaches have recently been proposed: vector of locally aggregated descriptors (VLAD) (Jégou et al. 2010) and Fisher vectors (FV) (Perronnin and Dance 2007). The FV characterizes how a sample of descriptors deviates from an average distribution that is modeled by the Gaussian mixture model (GMM) estimated on a training set. This model can be seen as a probabilistic visual vocabulary. VLAD, instead, accumulates, for each image, the difference between the visual words from a vocabulary (hundreds) and the associated descriptors. Thus it can be seen both as an extension of the BoW and a simplified nonprobabilistic version of FV. Typically FVs achieve better results, however, many recent VLAD extensions (e.g., Delhumeau et al. 2013) have obtained very good results on largely used public datasets such as INRIA Holidays and Oxford Buildings.

Aggregation techniques describe the whole image. However, sometimes the relevant landmark only appears in a small portion of the picture. In Lazebnik et al. (2006), BoW was extended by partitioning the image into increasingly fine subregions aggregating features inside each subregion.

A promising extension of the aggregation techniques is embedding geometric information related to the region of interest from which the local descriptors were extracted (Tolias et al. 2015). By comparing two descriptions obtained aggregating both the local features and the geometric information, weak geometric matching can be performed resulting in more robust recognition.

Given the global fixed length description obtained with these techniques various access methods can be used as Euclidean locality-sensing hashing (Datar et al. 2004) and surrogate text representation (Amato et al. 2013). A trade-off between quality and efficiency/scalability can also be applied by varying the parameters of these approximate indexing techniques.

To improve the effectiveness of the large-scale approaches there are two largely used techniques: query expansion and reranking. Reranking considers the results set obtained with a CBIR system as a new dataset that is compared, typically sequentially, with the query image in order to apply a more effective similarity measure, generally exploiting geometrical information (Philbin et al. 2007). To this aim, a number of highly ranked and geometrically verified images are used to expand the original query with the goal of improving retrieval effectiveness (Chum et al. 2007; Amato et al. 2011b).

Recently, approaches based on classification of the local descriptors have been proposed for both landmarks (Amato et al. 2015) and websites (Akusok et al. 2015). The intuition is that the whole query image classification can leverage on the classification of each query descriptor.

Binary local descriptors (e.g., BRIEF, BRISK, and ORB (Heinly et al. 2012)) have also been proposed that allow more efficient extraction and matching with respect to traditional SIFT and SURF features. They all share the idea of comparing the luminance of a pair of pixels of the region to be described. The resulting binary description can be compared using the Hamming distance. Their aggregation is typically less effective and thus they are mainly used on mobile devices when local features matching is requested for pose estimation, image registration, or augmented reality.

2.1.2 Deep Learning

In Razavian et al. (2014) deep learning techniques, and in particular convolutional neural networks, have been tested on visual recognition tasks. Basically, the activation values of the neuron of a certain layer of the network are used as a global description of the image. The resulting feature is a fixed-length floating point vector (typically 4.096) that is used in very similar ways as the aggregation of local features. In fact, these activation features are a sort of aggregated hierarchical feature. Please note that local feature aggregations are shallow and not deep because they do not create feature hierarchies (i.e., features defined combining other features). The results show that off-the-shelf neural networks can compete with state-of-the-art local descriptor aggregation techniques in image retrieval staff. At the moment, local features and their aggregations are still largely used, but we can expect the CBIR part of the overall landmark recognition pipeline to change significantly as a consequence of the adoption of convolutional neural networks-based features. Recently, efficient index approaches (Novak and Zezula 2014) have been successfully applied to activation features (Novak et al. to appear).

3 Augmented Reality

In an early review on the state of the art of augmented reality (Azuma et al. 1997), Azuma defines AR systems as:

Systems that have the following three characteristics:

Combines real and virtual
Interactive in real time
Registered in 3-D.

The previous definition points to applications providing a view of the external (physical) world augmented with additional information. As a consequence, an AR application must:

Determine the state of both the physical world and the virtual world.
Augment the view of the external world with digital information that needs to be perfectly aligned with the 3D space.

The key to a successful AR app stays mostly in the quality of the alignment: if such an alignment is precise and computed in real-time, the users will get a new experience of the world, where digital information is seamlessly integrated with the real view.

In the context of mobile AR apps for smart cities, the previous considerations led to the so-called “video-see-through” apps: mobile applications that in real-time overlay information on top of the live videostream acquired by the device camera. The first (and basic) mobile AR apps used only the data coming from two sensors of the devices: the GPS receiver, for the location, and the digital compass, for the direction. However, this approach only allows us to estimate a global position of the device in an outdoor environment. Indeed, the high-quality cameras and fast CPUs/GPUs^{Footnote 1} equipping current mobile devices enable the development of AR apps based on computer vision: by “understanding” the current scene facing the camera; apps are able to estimate the relative position of the user with respect to specific points of interest and overlay digital information on the details of the scene. This approach, commonly called natural feature learning and natural feature tracking, can be used in the implementation of both indoor and outdoor AR apps.

The rest of this section describes the basics of mobile AR app development based on computer vision algorithms. These apps usually require the following components.

A module for scene understanding that estimates the relative position of the device with respect to a known scene. This module uses algorithms and techniques borrowed from the image matching and object recognition fields. More precisely, the module estimates where the user is located with respect to one out of a set of objects, called targets.
A module for object tracking. The algorithms for image matching and object recognition, discussed in Sect. 20.2.1 are computationally expensive. When the app is first inspecting the scene looking for a target, the full computational power of the mobile device can be allocated to this task. However, once a target has been detected, the app must continuously update the device location and direction with respect to the detected target, while at the same time aligning and displaying the digital information. Tracking algorithms are able to “follow” an image area using techniques that are simpler and faster than those used for object recognition, freeing up resources for the other tasks.
A module for displaying virtual information. There are no constraints on the type of information that can be displayed. They can be as simple as text strings and icons, describing, for example, the details of the scene, or as complex as 3D models, which require specialized hardware and sophisticated algorithms from computer graphics. In order to offer a smooth user experience, sophisticated AR apps try to eliminate the shaking affecting the information displayed in overlay, which is caused by the very small differences in the estimated user position and direction. This is usually achieved borrowing methods from the signal theory field, for example, a Kalman filter.

In the following we describe each component in a specific use case: the recognition of a 2D image (the target) that triggers an augmentation with a 3D model that is placed on top of the target. The 3D model is a reconstruction of the Roman amphitheatre in Lecce. The target, shown in Fig. 20.2, is an overlay of two images: three quarters of it contains a textured top view of the 3D model, and the upper left quarter contains a symbolic 2D map of the same amphitheater.

Target Images.

Target images can be either synthetic or natural. Synthetic images (also known as fiducial markers) are usually composed of black and white dots, as shown in Fig. 20.1. They have high contrast, are asymmetric, and can be easily detected and tracked in almost any scene as long as they are fully visible. However, such markers come with some limitations: they are not aesthetically pleasing, especially if printed at larger sizes, and they cannot be used in AR contexts where the aim is to augment the view of large outdoor scenes. Fiducial markers are best used on printed materials, where their presence communicates the availability of additional virtual context. Current research on the recognition of fiducial markers focuses on improving the robustness toward occlusions (Bergamasco et al. 2011). Natural images, such as photographs or photorealistic pictures, are much harder to process and recognize. In our context, it means executing an object recognition algorithm on each video frame captured by the camera.

Target Placement.

Section 20.2.1 presented some techniques for executing object recognition tasks. As already anticipated, one of the most successful approaches for recognizing targets in the AR domain uses sparse image representations based on local features, that is, visual properties extracted from neighborhoods of salient image points. Given the target image T and the video frame F, in order to establish whether T is in F, both the target and the frame are represented by a set of keypoints and related visual descriptors, also called feature vectors. A visual keypoint is a location in the image, identified by its coordinates (x, y), whose local visual properties are stable with respect to some geometrical transformations, such as rotation and scaling. The feature vector associated with a keypoint contains statistics about the color or the intensity values in an image patch surrounding the keypoint location (x, y), encoded as real-valued or binary numbers. There are several algorithms for the extraction of visual keypoints, each offering a different degree of invariance to affine or projective transformations and a different computational cost. In recent years there has been an increasing interest in proposing new keypoints and descriptors that enable fast extraction and matching procedures and that require a small memory footprint (Bay et al. 2006; Leutenegger et al. 2011; Lowe 1999; Rublee et al. 2011). A recent review and comparison can be found in Yang and Cheng (2014). Figure 20.2 shows the keypoints extracted by the oriented FAST and rotated brief (ORB) algorithm (Rublee et al. 2011) from the target image and a video frame. The ORB features are well suited for mobile apps mainly due to their binary descriptors that have low memory requirements and allow fast comparison. Given the two sets of keypoints extracted from the target T and the video frame F (with the related descriptors), the matching procedure computes an initial set of matches M by associating each target descriptor with the closest frame descriptor according to a given distance function that depends on the type of the descriptors, whether real-valued or binary. However, M contains only couples of points whose descriptors are similar: the number of matches does not provide any real information about the presence of the target. The left image in Fig. 20.3 shows the matching ORB keypoints between the target and a video frame containing the target. The right image in Fig. 20.3 shows the matches between the target and a random photograph: even if the two images are very different, the number of matching keypoints is quite high. In order to establish whether the target T is present in the frame F, the app needs to verify that the relative positions of the matching keypoints in the two images are compatible with plausible geometric transformations of the target image. Before performing this geometric verification, it is worth reducing the number of matches by applying one or more heuristics, as discussed, for example, in the seminal paper by Lowe (1999). The images in Fig. 20.3 show only the matches that passed a filtering step using several heuristics; as the images show, there are false matches even after such an aggressive filtering.

Because the target image is a planar object, the most likely transformation affecting its view in the frame is projective: the app needs to verify whether an homography H mapping T onto F exists (Amato et al. 2011a). It is likely that some of the matches in M are not compatible with a projective transformation whereas others are: the app accepts an homography H if the number of compatible matches (inliers) is high and the number of incompatible matches (outliers) is low. A very common approach for estimating H is the iterative random sample consensus (RANSAC) algorithm (Fischler and Bolles 1981). An homography corresponds to a 3×3 matrix, with eight degrees of freedom: in order to compute its parameters we need four point correspondences. First, RANSAC chooses four random matches, assumes that they are inliers, then uses them to compute the parameters of an homography \(\hat{H}\). This estimate is then applied to all the other matches: the keypoints in the target are projected using \(\hat{H}\) and are counted as inliers only if their projection falls within a small radius of the corresponding keypoint in the video frame. After this estimation is repeated M times, RANSAC returns the estimate \(\hat{H}\) with the largest number of inliers. RANSAC can return a wrong estimate \(\hat{H}\) thus it is worth applying some heuristics to detect and reject the estimates that are likely incorrect (Kumar and Jawahar 2006). Heuristics that are commonly used concern the study of the determinant of the matrix, and the size and convexity of the projected region.

In order to locate more precisely the target image in the video frame, it is possible to refine the initial estimate \(\hat{H}\) by applying its inverse matrix \(\hat{H}^{-1}\), as shown in Fig. 20.4, to the video frame and computing a second homography between the projected frame and the target image T. The final homography matrix is the composition of the two homographies and is used to start the tracking of the target image. In fact, this homography embeds all the information needed to determine the relative position of the smart device facing the scene with respect to the recognized object or target.

Image Tracking.

Once the target is localized, the app needs to stop extracting the visual features for object recognition and start tracking the region of the frame occupied by the target. Many tracking algorithms rely on the computation of the optical flow, defined as the apparent motion of the objects, surfaces, or edges in the image, between two consecutive video frames. When the aim is to track a target image in real-time on mobile devices, the choice is restricted to algorithms with a low computational cost. Many successful algorithms are based on the Lucas-Kanade transform (LKT) (Lucas et al. 1981). Given an initial set of locations (called tracks), LKT establishes how they moved by iteratively minimizing the squared differences between neighborhoods of the tracks. The success of the tracking also depends on the choice of the points to track: in the example of Fig. 20.5 the tracks have been extracted using the algorithm known as “Good Features to Track” (Shi and Tomasi 1994).

Tracking an image for AR purposes is quite difficult: the tracks are usually lost by the tracker after some iterations (especially when the target is only partially contained in the video frame). The screenshots in Fig. 20.5 show an iPhone app tracking the target: the goal of the tracker is to maintain an accurate homography when the user moves the device closer to or farther from the target, while at the same time rotating it. As the images show, the tracker needs to reinitialize the tracks quite often, but it is able to maintain a good approximation of the initial homography, even if the target has been only partially visible for several times.

Aligning the Information.

Once we know the location of the target we can display information in perfect alignment with the content of the scene. When the virtual information consists only of 2D data, the homography is enough for getting a good alignment. In the case of more complex visualizations, such as 3D models or interactive elements managed by complex window managers as in the iOS platform, we cannot simply use the homography returned by the tracker. We need to estimate the camera pose in a pinhole camera model (Hartley and Zisserman 2003). The procedure for computing the camera matrix used in the model, projecting 3D points on the 2D camera image, is not particularly complex, but it requires an intrinsics matrix that specifies the physical characteristics of the camera acquiring the videostream. The intrinsics can be computed using several camera calibration algorithms, among which the most used is Zhang (2000).

Figure 20.6 shows the app running on an iPad rendering a very complex 3D model on top of the recognized target image.

Augmenting with Respect to the 3D Geometry of the World.

The app discussed in this chapter is able to recognize a 2D image and overlay a 3D object (or other info) on top of it. The app can be easily extended to recognize a full scene in the videostream and overlay models on top of it, but it is restricted to projecting information on planar surfaces. More complex AR apps can align the virtual information with respect to the geometry of the external 3D world. This geometry needs to be computed as a point cloud. Point clouds of the 3D objects can be built via laser scanning or using a stucture-from-motion approach, in which the 3D geometry of an object is inferred from a set of photographs. For example, in Snavely et al. (2006) the authors present Bundler, an algorithm able to build a 3D model of a scene from a large collection of photographs. The alignment in 3D is performed by combining the algorithms and techniques described in Sect. 20.2 with a matching procedure similar to Bundler’s.

Frameworks, Software, and Libraries.

Nowadays several software tools and programming frameworks are available for the implementation of AR apps. A very good open-source multiplatform library is ARToolKit.^{Footnote 2} It offers both synthetic image and natural feature tracking, pose estimation, and 3D rendering via OpenSceneGraph^{Footnote 3} or Unity.^{Footnote 4} Commercial alternatives to ARToolKit are Qualcomm Vuforia,^{Footnote 5} Total Immersion D’Fusion,^{Footnote 6} and Wikitude.^{Footnote 7} Complex 3D models can be rendered with the VCG library,^{Footnote 8} developed in the ISTI-CNR in Pisa, or with libGDX.^{Footnote 9} People with a basic knowledge of machine vision might want to use the OpenCV library,^{Footnote 10} that contains the implementation of many basic algorithms. All the images in this chapter were produced by proprietary source code that uses the OpenCV and VGC libraries. Point clouds can be built using, among others, Bundler^{Footnote 11} (Snavely et al. 2006) or Microsoft Photosynth.^{Footnote 12}

Notes

1.
CPU: central processing unit; GPU: graphics processing unit.
2.
http://artoolkit.org/
3.
http://www.openscenegraph.org/
4.
https://unity3d.com/
5.
https://www.qualcomm.com/products/vuforia
6.
http://www.t-immersion.com/products/dfusion-suite
7.
http://www.wikitude.com/blog/dev2dev/
8.
http://vcg.isti.cnr.it/vcglib/
9.
https://libgdx.badlogicgames.com/
10.
http://opencv.org/
11.
http://www.cs.cornell.edu/~snavely/bundler/
12.
https://photosynth.net/

References

Akusok A, Miche Y, Karhunen J, Bjork KM, Nian R, Lendasse A (2015) Arbitrary category classification of websites based on image content. IEEE Comput Intell Mag 10(2):30–41. 10.1109/MCI.2015.2405317
Article Google Scholar
Amato G, Bolettieri P, Falchi F, Gennaro C (2013) Large scale image retrieval using vector of locally aggregated descriptors. In: Similarity search and applications. Springer, Heidelberg, pp 245–256
Chapter Google Scholar
Amato G, Falchi F, Claudio G (2015) Fast image classification for monument recognition. J Comput Cult Herit 8(4):18:1–18:25. http://dl.acm.org/citation.cfm?id=2724727
Amato G, Falchi F, Gennaro C (2011a) Geometric consistency checks for KNN based image classification relying on local features. In: Proceedings of the fourth international conference on SImilarity Search and APplications (SISAP’11). ACM, New York, pp 81–88. 10.1145/1995412.1995428. http://doi.acm.org/10.1145/1995412.1995428
Amato G, Falchi F, Gennaro C (2011b) Geometric consistency checks for KNN based image classification relying on local features. In: Proceedings of the fourth international conference on SImilarity Search and APplications. ACM, New York, pp 81–88
Google Scholar
Azuma RT et al (1997) A survey of augmented reality. Presence 6(4):355–385
Google Scholar
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Leonardis A, Bischof H, Pinz A (eds) Computer vision – ECCV 2006. Lecture notes in computer science, vol 3951. Springer, Berlin/Heidelberg, pp 404–417
Google Scholar
Bergamasco F, Albarelli A, Rodola E, Torsello A (2011) Rune-tag: a high accuracy fiducial marker with strong occlusion resilience. In: IEEE conference on computer vision and pattern recognition (CVPR 2011). IEEE, pp 113–120
Google Scholar
Bhattacharya P, Gavrilova M (2013) A survey of landmark recognition using the bag-of-words framework. In: Intelligent computer graphics 2012. Springer, pp 243–263
Google Scholar
Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. In: IEEE 11th international conference on computer vision (ICCV 2007). IEEE, pp 1–8
Google Scholar
Crandall DJ, Backstrom L, Huttenlocher D, Kleinberg J (2009) Mapping the world’s photos. In: Proceedings of the 18th international conference on world wide web. ACM, pp 761–770
Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry. ACM, pp 253–262
Google Scholar
Datta R, Li J, Wang JZ (2005) Content-based image retrieval: approaches and trends of the new age. In: Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval (MIR’05). ACM, New York, pp 253–262
Google Scholar
Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 653–656
Google Scholar
Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395
Article Google Scholar
Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge/New York
Google Scholar
Heinly J, Dunn E, Frahm JM (2012) Comparative evaluation of binary features. In: Computer vision–ECCV 2012. Springer, pp 759–773
Google Scholar
Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Computer vision–ECCV 2008. Springer, pp 304–317
Google Scholar
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: IEEE conference on computer vision & pattern recognition, IEEE Computer Society, Washington DC, http://ieeexplore.ieee.org/document/5540039/
Kumar DS, Jawahar C (2006) Robust homography-based control for camera positioning in piecewise planar environments. In: Computer vision, graphics and image processing. Springer, pp 906–918
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition 2006, vol 2. IEEE, pp 2169–2178
Google Scholar
Leutenegger S, Chli M, Siegwart R (2011) Brisk: binary robust invariant scalable keypoints. In: IEEE international conference on computer vision (ICCV 2011). IEEE, pp 2548–2555
Google Scholar
Lowe D (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision 1999, vol 2. IEEE, pp 1150–1157
Google Scholar
Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI, IEEE Computer Society, Washington DC, vol 81. pp 674–679. http://ieeexplore.ieee.org/document/5540039/
Novak D, Batko M, Zezula P (to appear) Large-scale image retrieval using neural net descriptors. In: 38th ACM SIGIR international conference of research and development on information retrieval. Springer
Google Scholar
Novak D, Zezula P (2014) Rank aggregation of candidate sets for efficient similarity search. In: Database and expert systems applications. Springer, pp 42–58
Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: IEEE conference on computer vision and pattern recognition (CVPR’07). IEEE, pp 1–8
Google Scholar
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition (CVPR’07). IEEE, pp 1–8
Google Scholar
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW 2014). IEEE, pp 512–519
Google Scholar
Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: IEEE international conference on computer vision (ICCV 2011). IEEE, pp 2564–2571
Google Scholar
Schroth G, Huitl R, Chen D, Abu-Alqumsan M, Al-Nuaimi A, Steinbach E (2011) Mobile visual location recognition. IEEE Signal Process Mag 28(4):77–89
Article Google Scholar
Shi J, Tomasi C (1994) Good features to track. In: IEEE computer society conference on computer vision and pattern recognition (CVPR 1994). IEEE, pp 593–600
Google Scholar
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the ninth IEEE international conference on computer vision (ICCV’03), vol 2, pp 1470. IEEE Computer Society, Washington, DC
Google Scholar
Snavely N, Seitz SM, Szeliski R (2006) Photo tourism: exploring photo collections in 3D. In: ACM transactions on graphics (TOG), vol 25. ACM, pp 835–846
Google Scholar
Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image retrieval. Comput Vis Image Underst 140:9–20
Article Google Scholar
Weyand T, Leibe B (2015) Visual landmark recognition from internet photo collections: a large-scale evaluation. Comput Vis Image Underst 135:1–15
Article Google Scholar
Yang X, Cheng KT (2014) Local difference binary for ultrafast and distinctive feature description. IEEE Trans Pattern Anal Mach Intell 36(1):188–194
Article Google Scholar
Zhang Z (2000) A flexible new technique for camera calibration. IEEE Trans Pattern Anal Mach Intell 22(11):1330–1334
Article Google Scholar

Download references

Author information

Authors and Affiliations

CNR-ISTI, Pisa, Italy
Giuseppe Amato, Franco Alberto Cardillo & Fabrizio Falchi

Authors

Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar
Franco Alberto Cardillo
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Falchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Amato .

Editor information

Editors and Affiliations

CNR-IBAM Institute of Archaeological and Monumental Heritage, Tito Scalo, Potenza, Italy
Nicola Masini
Institute for Electromagnetic Sensing of the Environment (CNR-IREA), Napoli, Italy
Francesco Soldovieri

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Amato, G., Cardillo, F.A., Falchi, F. (2017). Technologies for Visual Localization and Augmented Reality in Smart Cities. In: Masini, N., Soldovieri, F. (eds) Sensing the Past. Geotechnologies and the Environment, vol 16. Springer, Cham. https://doi.org/10.1007/978-3-319-50518-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-50518-3_20
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50516-9
Online ISBN: 978-3-319-50518-3
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics