Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

9.1 Introduction

Computer vision has made huge advances since its beginnings in the 1960s. After a slow start, plagued by limited computing power and sometimes overly optimistic predictions (such as implementing a generic vision system over a summer), recent years have seen increasing numbers of real world applications appearing on the market, from face tracking in consumer digital cameras, driving assistance systems in cars, autonomous vacuum cleaning robots to augmented reality applications or home entertainment. Of course industrial machine vision confined to the clearly structured environments of factory floors and assembly lines or medical imaging applications with a human in the loop have been on the market far longer. But within the scope of this article, we are interested in computer vision as it was seen by its early proponents as exemplified by Roberts (1965), Binford (1971), Clowes (1971), Huffman (1971), Waltz (1975), Nevatia and Binford (1977), Marr (1982), Biederman (1987): to understand the computational principles that allow human or animal vision to seemingly arrive at generic scene interpretations from images. Or put another way, vision that serves an agent to operate in and interact with the unconstrained real three-dimensional world.

This is of course a very broad definition and encompasses many different abilities related to locomotion, manipulation, learning, recognition, or social interactions, as has been emphasised by  Sloman (1989). The many tasks that vision thus has to fulfil go beyond merely reconstructing and interpreting a three-dimensional scene. Many intermediate or very specific results of visual processing serve as input to, e.g. motor control or influence affective states. Part of the recent successes of vision, apart from increased computing power and the availability of new mathematical tools, is a high degree of specialisation of solutions in each of these areas.

For many of these specialised solutions impressive videos can be watched online and one is left wondering “Ok, done! So, where is the problem?”. Yet performance of robots at competitions like the Semantic Robot Vision ChallengeFootnote 1 or RoboCup@Home,Footnote 2 while clearly progressing from year to year, show that still these parts do not necessarily make a whole. There are of course many more problems to be solved in a complete robotic system besides vision, such as issues of power consumption and dexterity in manipulation, but limitations in perception and most notably vision typically do play a central role. So something must be still missing.

In the next section, we will review a selection of state-of-the-art solutions in some of the respective areas, showing which impressive things computer vision can in fact already do. Section 9.3 will then try to identify some fundamental problems in the current approach to computer vision, followed by suggestions on where future research could advance on a broader basis in Sect. 9.4.

9.2 What Vision Can Do

The following selection of work is not intended as a genuine review of work in different areas of computer vision, but rather to highlight some state-of-the-art solutions that taken together could seem to have solved computer vision for robotics. So, what can vision do for robotics?

9.2.1 Navigation, Localisation

Self localisation and mapping (SLAM) has been addressed by the robotics community early on, starting with ultrasonic and later laser range sensors where it can essentially be considered solved (Thrun et al. 2005). With increasing computational power and mathematical tools such as sparse bundle adjustment (Lourakis and Argyros 2009) vision based methods (Visual SLAM) began to replace laser based ones, e.g. (Nistér et al. 2006; Davison et al. 2007), which can now handle very large areas (Cummins and Newman 2010).

These methods rely on the robust extraction of uniquely identifiable image regions, for which a variety of image features have been proposed, such as MSER (Matas et al. 2002), SIFT (Lowe 2004), FAST (Rosten and Drummond 2006), SURF (Bay et al. 2008) or DAISY (Tola et al. 2008). These features, in general, play an essential role in many modern computer vision approaches, from SLAM and structure from motion to object recognition and tracking.

9.2.2 3D Reconstruction

Using similar techniques, structure from motion (SfM) approaches put an emphasis of dense reconstruction of the scene rather than navigation based on sparse landmarks. Microsoft’s Photo Tourism (Snavely et al. 2006) is quite well known. It can reconstruct in very high detail a building such as the cathedral of Notre Dame from a collection of thousands of photographs taken from the web. Going even further (Agarwal et al. 2009) scales the approach to entire cities, although that does take, as the title of their chapter suggests, a day of computation on a cluster of 500 computers.

Real-time solutions are available for smaller-scale scenes. The approach by Klein and Murray (2007) uses a parallel processing pipeline highly optimised to today’s multi-core machines to build a semi-dense map of the environment based on tracking distinctive image features. Based on that the work by Newcombe and Davison (2010) fills in the details using GPU-based optical flow computation (Zach et al. 2007) to arrive at a dense 3D scene reconstruction with visually very pleasing results.

9.2.3 Scene Segmentation

The above approaches reconstruct the scene as a whole, essentially treating it as a single rigid and static object. Multibody structure from motion approaches, (Fitzgibbon and Zisserman 2000; Ozden et al. 2010) observe a dynamic scene and segment it into independently moving rigid objects.

Given only a static scene (Rusu et al. 2009) segments a 3D point cloud as provided by stereo or depth sensors into parametric object models such as planes, spheres, cylinders, and cones. Similarly Biegelbauer et al. (2010) fit superquadrics to point clouds to seamlessly cover a wider range of parametric shapes. Using a strong prior model of the 3D scene and again parametric object models Hager and Wegbreit (2011) is able to handle scenes exhibiting complex support and occlusion relations between objects, and also reasons explicitly about dynamic changes of the scene such as objects being moved, added or removed.

Taking a more active approach Björkman and Kragic (2010) combine wide angle and foveated stereo to segment 3D objects of arbitrary shape standing isolated on a supporting surface. Even more active, Fitzpatrick and Metta (2003) use a robot manipulator to poke parts of the scene in order to use the resulting motion in 2D image sequences to segment objects.

Recent advances in 3D sensing, most notably the Microsoft Kinect RBD-D sensor, brought a renewed interest in 3D methods. Having (close to) veridic depth perception simplifies the segmentation problem and allows segmentation of quite cluttered indoor scenes in real-time (Ückermann et al. 2012) into objects described in terms of parametric surface models (Richtsfeld et al. 2012).

9.2.4 Recognition

Object recognition is of course a central theme in computer vision especially in the context of robotics. Early attempts at generic recognition of 3D solids  (Binford 1971; Waltz 1975; Nevatia and Binford 1977; Marr and Nishihara 1978; Brooks 1983; Biederman 1987; Lowe 1987; Dickinson et al. 1992), often based on edge features, tended to suffer from scene complexity and textured surfaces. With the advent of invariant interest point detectors (Mikolajczyk and Schmid 2004) and strongly distinctive point descriptors mentioned above (Matas et al. 2002; Lowe 2004; Rosten and Drummond 2006; Bay et al. 2008; Tola et al. 2008) appearance based recognition of arbitrarily shaped object instances in highly cluttered real world environments was essentially solved, e.g. (Lowe 1999; Gordon and Lowe 2006; Ferrari et al. 2006; Özuysal et al. 2007; Collet et al. 2009; Mörwald et al. 2010), even for non-rigid objects such as clothing (Pilet et al. 2007)—provided of course that the respective objects are textured. Making use a combination of colour image and dense depth map the fast template based approach by Hinterstoisser et al. (2011) also detects untextured objects in heavy clutter at close to frame rate.

The above appearance-based methods are intrinsically suited to detect individual object instances with specific surface markings. Going beyond single instances approaches  (Fei-Fei et al. 2006; Leibe and Schiele 2003; Dalal and Triggs 2005) detect categories also of deformable objects such as cows or walking humans.

9.2.5 Online Learning

Acquiring models for the above recognition methods often involves hand-labelling of images or placing objects on turn tables as an offline learning step, which is clearly not desirable for an agent supposed to act autonomously in the world.

Various online learning methods have been proposed, such as Özuysal et al. (2006) which keeps “harvesting” additional features as it tracks the model acquired so far. The ProFORMA system (Pan et al. 2009) even reconstructs high quality dense triangle meshes while tracking a model and also suggesting new views to add.

Going further in the direction of a complete system Kraft et al. (2008) and Welke et al. (2010) let a robot pick up and rotate objects in its hand to actively cover all views of an object.

9.2.6 Tracking

Much as recognition, model-based 3D object tracking has been well covered in computer vision (Lepetit and Fua 2005). Especially with the availability of cheap and powerful graphics cards computationally heavy methods such as particle filtering (Klein and Murray 2006) have been rendered real time (Chestnutt et al. 2007; Murphy-Chutorian and Trivedi 2008; Sánchez et al. 2010; Choi and Christensen 2010; Mörwald et al. 2011) and allow tracking of complex 3D objects through heavy clutter.

So far for an (incomplete) overview of some of the success stories of computer vision in the realm of robotics. Next we will look at where we stand with this and why service robots are not yet scurrying around in our apartments.

9.3 What Vision Can’t Do

What vision can’t do is simply to allow a robot to operate in and interact with the unconstrained real three-dimensional world, as was our stated goal in the introduction.

9.3.1 Abstraction

One of the reasons why this is the case is explored in the very comprehensive review by Dickinson (2009). The author there sums up the evolution of object categorisation over the past four decades as different attempts to bridge the large representational gap between the raw input image at the lowest level of abstraction and 3D, viewpoint invariant, categorical shape models at the highest level of abstraction. In the 1970s, this gap was closed by using idealised images of textureless objects in controlled lighting to extract quite generic shape models. In the 1980s, the images could become more complex, by sacrificing model generality and searching for specific 3D shapes, thus effectively closing the gap at a lower level. Methods of the 1990’s allowed recognition of complex textured objects in cluttered scenes, however objects were now essentially 2D appearance models of specific instances (and even views), thus closing the gap very low at the image level. The feature-based methods of the 2000s allowed recognition of arbitrary 3D object instances in very cluttered environments, while also slowly extending generality back up towards object categories.

So, much of the success of vision was bought by sacrificing generality and the ability of abstraction. This is less of a problem for navigation, where anything is an obstacle or a landmark, but more so for purposeful interaction with specific parts of the scene, viz, objects. Learning each object individually or perhaps narrow categories of objects is not feasible in the long run and does not provide the ability to make sense of a given scene even though a similar scene was never encountered. Humans, say an Inuit seeing tropical jungle for the first time, have no problem perceiving a completely unfamiliar scene in terms of complete 3D shapes plus their possibilities of interaction rather than an assortment of object and category labels. Otherwise the mentioned Inuit would have to appear essentially cortically blind, having no categories for all the different tropical trees and bushes.

9.3.2 Putting it Together

Another reason as to why the assorted successes of vision do not yet comprise a unified solution for robotic vision lies in the difficulties of merging these specialised solutions under one framework. One of the difficulties is robustness. Many methods rely on tuning of parameters or some hidden implicit assumptions. Operating these methods outside their safe zones can make them fail abruptly rather than degrade gracefully, leaving a system comprised of many such isolated solutions extremely brittle. A more severe problem actually lies in bridging the semantic gaps between different methods. What does it mean if an object recogniser reports a confidence of \(0.4\) of detecting an object right inside a wall while robot localisation reports an uncertainty of \(40\,\mathrm{{cm}}\)? Approaches like Hoiem et al. (2006) have started exploring the interplay between, e.g. object recognition and estimation of coarse 3D scene geometry, and the work by Hager and Wegbreit (2011) mentioned above explicitly reasons about support and occlusion relations between objects. But a more generic solution of integrating the semantics (together with uncertainties) of individual processing results still seems far off.

9.3.3 Dealing with Failure

A third, somewhat related reason is that researchers in individual specialised sub-fields (quite naturally) strive for perfection, inching recognition rates on standardised benchmarks ever higher in increments of \(0.5\,\%\), while from a systems perspective it makes more sense to accept the inevitable uncertainties and failure modes and reason explicitly about them. This of course hinges on having a common framework as explained above, to meaningfully express these uncertainties. Even more importantly however it requires researchers to accept that perfection is futile.

9.4 What Vision Should Do

So what should be done to alleviate the above problems? There is of course no simple answer to this. But let us first look at some of the apparent solutions.

9.4.1 It Isn’t 3D

With the availability of cheap and powerful 3D sensors such as complete stereo solutions by companies like Point GreyFootnote 3 or Videre DesignFootnote 4 or depth sensors such as the Mesa Imaging SwissrangerFootnote 5 or Microsoft KinectFootnote 6 one important part of vision seems to be have been solved, namely reconstructing a 3D scene. There is more to it though, as humans do not perceive a scene as a sort of flip-up cardboard diorama with missing object back sides. Reasoning about occluded parts of the scene as well as generic segmentation into individual objects remains to be solved.

More importantly however, a quick test on yourself by closing one eye will reveal that 3D sensing is not all that important for human vision. Picking up a small object or putting a key into a lock might require several attempts, so clearly direct perception of distance via stereo is an advantage for close-range manipulation, such as using tools or grasping branches when swinging from tree to tree. But the vivid impression of being situated in a 3D scene does not suffer significantly when being deprived of stereo vision. Also many grazing animals tend to have non-overlapping fields of view of the left and right eye, as large field of view (to notice approaching predators) is more important than accurate 3D perception.

For various reasons cups and cows are prominent example objects in computer vision. Advocates of 3D computer vision will point out that given a cup with a picture of a cow printed on it, a 2D recogniser would be likely to rather recognise a (nicely textured) cow than a (probably untextured) porcelain cup, whereas a 3D shape based recogniser would correctly identify the cup. However, given a 2D image of a cup with a cow on it humans have no problem recognising both, the cup and the fact that there is a picture of a cow printed on it.

We are not arguing that 3D sensing would not be a powerful cue, and in fact robotics is likely to benefit a lot from depth sensors in the near future. But 3D sensing does not seem to be essential for perceiving a 3D scene. Human vision has developed powerful computational mechanisms to infer a complete 3D scene from rather limited information. And these mechanisms are more important then a specific sensing modality.

9.4.2 It Isn’t Resolution

In a similar vein image resolution does not seem to be critical. Certainly nature has evolved foveated vision for a good reason. The combination of attentional mechanisms based on low resolution cues with saccades to salient image regions to be processed at high resolution is an important mechanism to optimise visual processing and keep the amount of information tractable. Likewise any computer vision task benefits significantly from the object of interest being shown large and centred in the image, rather than occupying a small image region somewhere in the scene. However, humans looking at a low resolution, say \(640 \times 480\), image of a scene typically have no problem interpreting it correctly (otherwise watching TV would be rather confusing).

Moreover, experiments with rapid serial visual presentation (Thorpe and Imbert 1989) have shown that humans are remarkably good at identifying objects and scenes at presentation times well below 200 ms, which leaves no time to perform any saccades. For example in Intraub (1981) subjects were able to identify pictures of a category (‘look for a butterfly’), superordinate category (‘look for an animal’) or negative category (‘look for a picture that is not of house furnishings and decorations’) presented for only 114 ms.

So the human visual system can perform (at least a significant deal of) its processing within an instant without requiring to scan the image with the high resolution fovea.

9.4.3 It Isn’t (Just) Bayes

There is no doubt that much progress in vision is owed to the adoption of probabilistic frameworks over crisp symbolic methods, which are typically too brittle when confronted with the cluttered, uncertain, ambiguous real world. However sometimes the actual probabilities at the end of some lengthy mathematical argument are rather ad-hoc, say the number of matching edgels divided by the total number of edgels, or assumptions about uniform priors. The respective approaches still work fine, thanks to the extraordinary robustness of statistical methods. But the results from different processing modules, although supposedly derived within the same mathematical framework, become difficult to compare to each other within a common system. Just using probabilities is not enough. Care has to be taken, that they refer in the same way to the same underlying causes.

A different way to treat uncertainties, rather than aiming for precise 3D estimates plus a measure or remaining uncertainty, could be to not aim for exactness in the first place. Instead one could use more qualitative measures, such as surface A is behind surface B, which is an observation that can be established with high certainty over a wide range of actual distances. This sort of information might be sufficiently accurate for many types of actions, such as reaching for A. However, it is not clear whether the mathematics for this kind of reasoning over a complex 3D scene would actually turn out to be simpler than more traditional probability theory.

9.4.4 Back to the Roots?

Armed with the lessons learned along the way (and with considerably increased computing power), it might be worth reconsidering some of the early approaches to computer vision. These in general aimed at reconstructing a 3D scene from very impoverished visual information, such as edge images only. While simply taking a depth sensor would certainly provide a more direct route, attacking the harder problem, with all the modern mathematical machinery, is still worthwhile. Partly because advances there are more likely to shed light on the computational principles underlying human vision. But also because, as pointed out above, the problem of dealing with incomplete and ambiguous information persists, no matter how rich the underlying sensory information. This problem can be pushed back a little by more advanced sensors, but not avoided altogether.

9.4.5 A Conjecture: Vision as Prediction

In the following, we will put forward a conjecture of what might be one of the computational principles underlying human vision, based on an anecdotal example of severely impoverished visual information.

I enter a room at night, put a glass of water on a table (Fig. 9.1a), walk back to the door to switch off the lights and the room becomes almost completely dark. I walk back to the table and can not actually see the glass or anything else on the table, or even the table itself. Scene reconstruction in this case is simply hopeless. Still I expect the glass at the same position where I left it and I can very roughly estimate that position by backtracking my steps. So I turn my head this way and that looking towards a window, which is slightly illuminated from outside, until I can see a glint typical of glass surfaces near the expected position (Fig. 9.1b). I reach out (carefully, as I might still collide with other unseen objects on the table) and successfully grab the glass (relying of course heavily on tactile feedback). By no means could I have reconstructed the glass with its 3D shape in that case, still I did ‘see’ it. Or rather I saw something I expected to see given that the glass were there.

Vision as a process of reconstructing the 3D scene is an ill-posed problem, yet humans seem to do it effortlessly. Still there are enough every-day cases where also for humans scene reconstruction becomes quite impossible (e.g. in very low light situations). Humans can however still employ vision successfully in such cases, because what if not vision eventually allowed the detection of the glass in the above-darkened room example. Only a little visual information was needed to confirm some hypothesis about the scene.

Fig. 9.1
figure 1

Glass on a table in normal viewing conditions (a) and in very low light (b). Reconstructing the scene in the latter case or detecting the glass using a geometrical model is very hard, even for humans. Confirming a given hypothesis from a few reflections on the glass surface however is far easier and suffices to locate the glass (images best viewed in colour)

The point is that that while reconstruction is a notoriously difficult problem, the inverse problem—prediction—is often very simple. One avenue of progress might thus be to view vision (at least in part) as a prediction problem, based on strong priors. A general framework should be able to incorporate multiple cues (visual and possibly non-visual), where the appearance of each cue (such as edges, shadows, highlights) is predicted, given a scene hypothesis. Predictions and actual observations could then used in a Bayesian filter to update an estimate of the scene.

This is just a rough sketch of course. But maybe observing human performance in such visually challenging situations can point the way to technical solutions that degrade equally gracefully.

9.5 Conclusion

Sloman (1978) quite correctly predicted that by the end of the twentieth century (and it is just as true more than a decade later) computer vision would not have progressed enough to be adequate for the design of general purpose domestic robots, and that only specialised machines (with specialised abilities) would be available. This is indeed the state computer vision is in today.

The power of biological vision systems seems not to lie in perfect sensors and processing results, but in dealing with imperfect ones. Failures, uncertainties and ambiguities are not exceptional states of an otherwise perfectly functioning system, but instead part of the normal flow of processing.

We should thus aim at understanding the powerful computational principles allowing biological vision to infer a sufficiently accurate model of reality from partial, ambiguous, sometimes erroneous information derived from various cues. Actually this is already happening within many of the approaches presented above in the form of probabilistic models, however not on a system wide level.

Bringing the individual successful pieces of vision together into an equally successful system that eventually allows robots to operate within the challenging environments of our apartments, remains a ambitious goal.