Abstract
The emergence of cheap structured light sensors, like the Kinect, opened the door to an increased interest in all matters related to the processing of 3D visual data. Applications for these technologies are abundant, from robot vision to 3D scanning. In this paper we go through the main steps used on a typical 3D vision system, from sensors and point clouds up to understanding the scene contents, including key point detectors, descriptors, set distances, object recognition and tracking and the biological motivation for some of these methods. We present several approaches developed at our lab and some current challenges.
L.A. Alexandre—This work was partially financed by FEDER funds through the Programa Operacional Factores de Competitividade - COMPETE and by Portuguese funds through FCT - Fundação para a Ciência e a Tecnologia in the framework of the project PTDC/EIA-EIA/119004/2010.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
There are currently many application fields for 3D computer vision (3DCV). One of the recent pushes to the 3D computer vision was the appearance of cheap 3D sensors, such as the Microsoft Kinect. This was not developed for 3D computer vision but for the (console) video gaming industry. 3DCV is used in games as a means to receive user input. Other applications of 3DCV can be found in biometrics, such as for 3D facial and expression recognition, in robotic vision, industrial quality control systems or even in online shoppingFootnote 1.
We present the current 3D technologies and the most used sensors in Sect. 2. In Sect. 3 the focus will be on keypoint extraction from 3D point clouds. Section 4 discusses 3D descriptors and the following section presents methods used on 3D object recognition. Section 6 presents a 3D tracking method based on keypoint extraction and Sect. 7 indicates some current challenges in this field. The final section contains the conclusion.
2 3D Sensors
There are several possible technologies for obtaining 3D images. These 3D images are in fact sets of points in space called point clouds. These points have, besides their 3D coordinates, typically at least a gray scale value or RGB value but can have other measures associated, such has a local curvature. The 3D images can also be represented by two 2D images: one containing the illumination intensity of color values for scene locations and the other the respective depth or distance to the sensor.
A basic approach to obtaining 3D images is by inferring the depth from two different views of a scene (parallax). This can be done by using a single camera and positioning it in different locations (for a static scene) or more commonly, by using two cameras, mimicking the animal’s visual sensors (eyes) layout, as in Fig. 1. The major difficulty in this approach is identifying the same scene point in both images to obtain the point disparity. Many approaches have been proposed to achieve thisFootnote 2.
Another way to obtain 3D visual data is by using active vision and projecting a pattern in the scene that is used to identify the scene points’ relative position. This approach is called a structured light approach. Figure 2 presents the idea and shows several sensors based on this approach. The pattern projection is usually made using infrared light such that it doesn’t appear in the visible image. A third approach to obtaining 3D images is by inferring the scene points’ distance to the sensor by measuring the time light takes to travel from an emitter located near the sensor, to the scene point and returning to the sensor. Since the speed of light in the air is known, the time taken is enough to infer the distance, or depth. Figure 3 illustrates this and presents some commercial available sensors based on this idea.
Size and weight have been falling to the point of currently having a 3D sensor inside a cell phone (see project Tango by Google), something that opens the way to many possible new mobile applications.
These sensors eventually produce a point cloud, typically at 30 fps. For 30 k points with RGB at 30 fps (typical Kinect specification), more than 30 MB/s of data are generated. This can be too much data specially for embedded applications, so some form of sub-sampling must be used to reduce the computational burden of processing this type of data stream.
3 Keypoints
Keypoints are a set of points considered representative of the point cloud. They are extracted from a point cloud when the full data stream is considered too much data for real-time processing. So keypoints are a way to do sub-sampling. Figure 4 presents two different approaches to keypoint extraction: regular spaced sub-sampling using a voxel grid with two different voxel sides (left 1 cm and center 2 cm) and a Harris3D extractor (right). The figure also shows the location of the keypoints (the black dots) and the number of extracted keypoints.
Humans don’t process every “input pixel”, but focus their attention on salient points.
We have recently proposed [6] a 3D keypoint detector based on a computational model of the human visual system (HVS): the Biologically Inspired 3D Keypoint based on Bottom-Up Saliency (BIK-BUS). This approach is inspired on the visual saliency and the method mimics the following HVS mechanisms:
-
Center-surround cells: sensitive to the center of their receptive fields and are inhibited by stimuli in its surroundings.
-
Color double-opponency: neurons are excited in the center of their receptive field by one color and inhibited by the opponent color (red-green or blue-yellow) while the opposite takes place in the surround.
-
Impulse response of orientation-selective neurons is approximated by Gabor filters.
-
Lateral inhibition: neighboring cells inhibit each other through lateral connections.
Figure 5 presents a general view of the proposed method. The input point cloud is filtered to obtain color, intensity and normal orientation data. This is then used to build multi-scale representations of these features (Gaussian pyramids) that are combined using a mechanism that simulates center-surround cells and a normalization operator motivated by lateral inhibition to generated feature maps. From these feature maps, new maps, called conspicuity maps, are generated combining information from multiple scales. The three conspicuity maps are combined into a single saliency map. Finally, from the saliency map, and through the use of inhibition mechanisms, the 3D keypoints can be selected.
We evaluated our proposal against 8 state-of-the-art detectors. We performed around 1.6 million comparisons for each pair keypoint detector/descriptor for a total of 135 pairs (9 keypoint detectors \(\times \) 15 descriptors). The evaluation considered two metrics: area under the ROC curve (AUC) and the decidability (DEC). Table 1 shows the number of times each keypoint detector was the best on the experiments run. BIK-BUS was a clear winner with the second best methods at a considerable distance.
4 Descriptors
4.1 Evaluating Descriptors
A descriptor is a measure extracted from the input data that represents or describes an input data region in a concise manner. They are used to represent input data and allow a system to keep only a condensed representation of the input data (they are the equivalent of features in standard pattern recognition). There is a wide choice of descriptors: which should one use? We made an evaluation of 13 available in PCL [1]. Figure 6 shows the time taken and space used by the evaluated descriptors when they were applied after 3 different keypoint detectors.
Figure 7 shows the Precision-Recall curves for the experiments that used the 1 cm voxel grid sub-sample keypoint detector. Color-based descriptors are better (PFHRGB and SHOTCOLOR). Further details, including the equivalent figures for the remaining 2 keypoint detector approaches can be found in [1].
4.2 Genetic Algorithm-Evolved 3D Point Cloud Descriptor
From the evaluation of the descriptors discussed in the above section, we concluded that accurate descriptors are very computationally intensive and faster descriptors use large storage space. For embedded approaches, such as robot-based vision, where computational resources and storage space come at a cost or might not be available in adequate amounts, a simple descriptor is desirable. For this type of application, we developed [8] a genetic algorithm(GA)-based descriptor that is both fast and has a small space footprint, while maintaining an acceptable accuracy.
It works by creating a keypoint cloud by sub-sampling with a voxel grid with leaf size of 2 cm. Two regions around each keypoint are considered: disk (\(R_1\)) + ring (\(R_2-R_1\)) (see Fig. 8).
The information stored by the descriptor considers both shape and color information around each keypoint. For the shape, the descriptor records the histogram of angles between normals at keypoint and at each neighbor in region. For the color information, a (Hue, Saturation) histogram for all points in each region is stored. The used distance between 2 point clouds represented by this descriptor is calculated using: \(d= w . d_{shape} + (1-w) . d_{color} \), where the weight w is obtained through the GA optimization procedure. In total, 5 parameters (#shape bins, #color bins, \(R_1\), \(R_2\), w) are searched using the GA on the training data set. The obtained results can be seen in Table 2. This proposal allows for a much faster and lightweight (in terms of space) descriptor, with accuracy comparable to the SHOTCOLOR descriptor, and is thus adequate for use in situations where the computational cost of algorithms is an issue and/ or the available storage space is small.
5 3D Object Recognition
The typical 3D object recognition pipeline consists on: obtaining the input data usually in the form of a point cloud; making keypoint detection; finding descriptors at each keypoint that are then grouped into a set that represents the input point cloud. After this, in a test ou deployment phase, incoming point clouds are compared against stored ones in an object database using, for instance, a set distance.
So, each point cloud is represented by a set of descriptors, and each descriptor is n-dimensional. In practice, a given point cloud will can have an arbitrary number of descriptors representing it, so the cardinal of the set of descriptors that represents the input data is not constant. To find the closest object in a database, a match to the input point cloud, we need to use a set distance.
5.1 Set Distances
Set distances are usually built around point distances. Three common point distances are the following: consider \(x,y \in \mathbb {R}^n\), then
-
City-block:
$$\begin{aligned} L_1(x,y)= \Vert x-y\Vert _1 =\sum _{i=1}^n |x(i) -y(i)| \end{aligned}$$ -
Euclidean:
$$ L_2(x,y)= \Vert x-y\Vert _2 =\sqrt{\sum _{i=1}^n (x(i) -y(i))^2}$$ -
Chi-squared:
$$ d_{\chi ^2}(x,y)= \frac{1}{2}\sum _{i=1}^n \frac{(x(i) -y(i))^2}{x(i) +y(i)} $$
Consider that a, b are points and A, B are sets. Let us also consider the following set distances:
-
\( D_1(A,B)= \max \{ \sup \{ f(a,B) \ | \ a \in A \} , \sup \{ f(b,A) \ | \ b \in B \}\}\) with \( f(a,B)= \inf \{L_1(a,b),\ b \in B \}\)
-
\(D_2\) = Pyramid Match Kernel distance [7]
-
\( D_3(A,B)= L_1(\text {min}_A,\text {min}_B) + L_1(\text {max}_A,\text {max}_B)\) with \( \text {min}_A(i)= \min _{j=1,\ldots ,|A|} \{a_j(i) \},\) \( i=1,\ldots ,n\) \( \text {max}_A(i)= \max _{j=1,\ldots ,|A|} \{a_j(i) \}, \ i=1,\ldots ,n\) and similarly for \(\min _B(i)\) e \(\max _B(i) \).
-
\(D_4(A,B)= L_1(c_A,c_B) \) where \(c_A,c_B\) are cloud centroids
-
\(D_5(A,B)= L_2(c_A,c_B) \)
-
\(D_6(A,B)=D_4(A,B)+L_1(std_A,std_B) \) with \(std_A(i)=\sqrt{\frac{1}{|A|-1}\sum _{j=1}^{|A|} (a_j(i)-c_A(i))^2}, \ i=1,\ldots ,n\ \) and similarly for \(std_B\).
-
\(D_7(A,B)=d_{\chi ^2}(c_A,c_B)+d_{\chi ^2}(std_A,std_B) \)
-
\(D_{8}(A,B)= \frac{1}{|A| |B|}\sum _{i=1}^{|A|} \sum _{j=1}^{|B|} L_1(a_i,b_j) \)
We evaluated [2] these 8 distances using 2 descriptors (PFHRBG and SHOTCOLOR). We used a data set with 48 objects from 10 categories and 1421 point clouds. The keypoint detector used was Harris3D. Figure 9 shows the precision-recall curves for the experiments with both descriptors. Table 3 contains the time it took for the evaluation of the test set on a machine running with 12 threads.
Simple distances like \(D_6\) and \(D_7\) are a good choice (accurate and fast) better than more common distances such as \(D_1\) and \(D_2\). Additionally, simple distances don’t need any parameter search, as is the case with \(D_2\).
5.2 Deep Transfer Learning for 3D Object Recognition
Deep learning is showing great potential in pattern recognition. The idea of transfer learning (TL) is also a very appealing one: learn in one problem and reuse (at least part of) the knowledge in other problems. We used both these ideas in a work where a convolutional neural network learns to recognize objects from 3D data [3]. TL is used from one color channel to the others and also to the depth channel. Decision fusion is used to merge each nets predictions. The results appear in Table 4. As can be seen, the TL approach is successful in obtaining both higher accuracy and shorter time that the baselines considered.
6 3D Object Tracking
The world is dynamic: another step towards understanding it is to follow objects as they move, since movement is a very important visual cue. There are many different approaches to tracking: the most used are particle filter variants [4].
We used a biologically-inspired keypoint extractor to initialize and maintain particles for particle filter-based tracking from 3D [5]. A general overview of the proposed method appears in Fig. 10.
We compared our tracker against the OpenNI tracker available in PCL. The videos used 10 different moving objects and a total of 3300 point clouds. The results are presented in Table 5. The PFBIK tracker used a much smaller number of particles that enabled it to be much faster during tracking, with the exception of the initialization where it was slower than the OpenNITracker given the necessary keypoint detection (that is only done at the start of the tracking process). The PFBIK was also slightly more accurate as can be seen through the distance error to the tracked object centroid.
7 Challenges
Although recent progress in 3DCV has been substantial, there are still many challenges in the field. Some of the current challenges faced by the 3D computer vision community are:
-
object representation: in this paper we showed object representations based on sets of descriptors of partial object views (2.5D) but other possibilities might be better (represent an object using a fused view representation, for instance). The best object representation approach may depend on the particular application and is still a important research topic;
-
non-rigid object recognition: current keypoint plus descriptor approach is not a good solution when the objects are not rigid. More complex models are needed (3D deformable models);
-
activity recognition: what are the best approaches to understand human activities from 3D video? This is currently a hot research topic;
-
real-time processing: GPU-based implementations of most algorithms can bring us here but it is still a problem with embedded devices (cloud-based processing requires high bandwidth and permanent connection).
8 Conclusion
This paper summarizes the invited talk present at ICPRAM 2015 where the author reviewed some of the key concepts of 3D computer vision and presented some of the recent work in this field produced by him and his co-authors.
Notes
- 1.
See http://metail.com/.
- 2.
Check for instance the disparity algorithms at the Middlebury Stereo Vision Page http://vision.middlebury.edu/stereo/.
References
Alexandre, L.A.: 3D descriptors for object and category recognition: a comparative evaluation. In: Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, October 2012
Alexandre, L.A.: Set distance functions for 3D object recognition. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds.) CIARP 2013, Part I. LNCS, vol. 8258, pp. 57–64. Springer, Heidelberg (2013)
Alexandre, L.A.: 3D object recognition using convolutional neural networks with transfer learning between input channels. In: Menegatti, E., Michael, N., Berns, K., Yamaguchi, H. (eds.) Intelligent Autonomous Systems 13. AISC, pp. 889–898. Springer, Heidelberg (2014)
Del Moral, P.: Mean field simulation for monte carlo integration. Chapman and Hall/CRC, Boca Raton (2013)
Filipe, S., Alexandre, L.: Pfbik-tracking: Particle filter with bio-inspired keypoints tracking. In: 2014 IEEE Symposium on Computational Intelligence for Multimedia. Signal and Vision Processing (CIMSIVP), pp. 1–8, Florida, USA, December 2014
Filipe, S., Itti, L., Alexandre, L.A.: BIK-BUS: biologically motivated 3D keypoint based on bottom-up saliency. IEEE Trans. Image Process. 24(1), 163–175 (2015)
Grauman, K., Darrell, T.: The pyramid match kernel: efficient learning with sets of features. J. Mach. Learn. Res. 8, 725–760 (2007)
Wȩgrzyn, D., Alexandre, L.A.: A genetic algorithm-evolved 3D point cloud descriptor. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds.) CIARP 2013, Part I. LNCS, vol. 8258, pp. 92–99. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Alexandre, L.A. (2015). 3D Computer Vision: From Points to Concepts. In: Fred, A., De Marsico, M., Figueiredo, M. (eds) Pattern Recognition: Applications and Methods. ICPRAM 2015. Lecture Notes in Computer Science(), vol 9493. Springer, Cham. https://doi.org/10.1007/978-3-319-27677-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-27677-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27676-2
Online ISBN: 978-3-319-27677-9
eBook Packages: Computer ScienceComputer Science (R0)